Redcard Exploratory Data Analysis

This dataset is taken from a fantastic paper that looks to see how analytical choices made by different data science teams on the same dataset in an attempt to answer the same research question affect the final outcome.

Many analysts, one dataset: Making transparent how variations in analytical choices affect results

The data can be found here.

The Task

Do an Exploratory Data Analysis on the redcard dataset. Keeping in mind the question is the following: Are soccer referees more likely to give red cards to dark-skin-toned players than light-skin-toned players?



In [1]:

    
!conda install -c conda-forge pandas-profiling -y 

!pip install missingno









    



Fetching package metadata ...........
Solving package specifications: .

Package plan for installation in environment /home/nbcommon/anaconda3_410:

The following NEW packages will be INSTALLED:

    pandas-profiling: 1.3.0-py35_0  conda-forge

The following packages will be SUPERSEDED by a higher-priority channel:

    conda:            4.3.17-py35_0             --> 4.2.13-py35_0 conda-forge
    conda-env:        2.6.0-0                   --> 2.6.0-0       conda-forge

conda-env-2.6. 100% |################################| Time: 0:00:00   1.17 MB/s
conda-4.2.13-p 100% |################################| Time: 0:00:00   1.44 MB/s
pandas-profili 100% |################################| Time: 0:00:00 663.79 kB/s
Collecting missingno
  Downloading missingno-0.3.5.tar.gz
Requirement already satisfied (use --upgrade to upgrade): numpy in /home/nbcommon/anaconda3_410/lib/python3.5/site-packages (from missingno)
Requirement already satisfied (use --upgrade to upgrade): matplotlib in /home/nbcommon/anaconda3_410/lib/python3.5/site-packages (from missingno)
Requirement already satisfied (use --upgrade to upgrade): scipy in /home/nbcommon/anaconda3_410/lib/python3.5/site-packages (from missingno)
Requirement already satisfied (use --upgrade to upgrade): seaborn in /home/nbcommon/anaconda3_410/lib/python3.5/site-packages (from missingno)
Requirement already satisfied (use --upgrade to upgrade): python-dateutil in /home/nbcommon/anaconda3_410/lib/python3.5/site-packages (from matplotlib->missingno)
Requirement already satisfied (use --upgrade to upgrade): pytz in /home/nbcommon/anaconda3_410/lib/python3.5/site-packages (from matplotlib->missingno)
Requirement already satisfied (use --upgrade to upgrade): cycler in /home/nbcommon/anaconda3_410/lib/python3.5/site-packages (from matplotlib->missingno)
Requirement already satisfied (use --upgrade to upgrade): pyparsing!=2.0.4,>=1.5.6 in /home/nbcommon/anaconda3_410/lib/python3.5/site-packages (from matplotlib->missingno)
Requirement already satisfied (use --upgrade to upgrade): six>=1.5 in /home/nbcommon/anaconda3_410/lib/python3.5/site-packages (from python-dateutil->matplotlib->missingno)
Building wheels for collected packages: missingno
  Running setup.py bdist_wheel for missingno ... - \ done
  Stored in directory: /home/nbuser/.cache/pip/wheels/f3/9f/31/38d2fad2bd1ac3ac70a2d159c61515de5825296429a4f13056
Successfully built missingno
Installing collected packages: missingno
Successfully installed missingno-0.3.5
You are using pip version 8.1.2, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.



In [2]:

    
%matplotlib inline
%config InlineBackend.figure_format='retina'

from __future__ import absolute_import, division, print_function
import matplotlib as mpl
from matplotlib import pyplot as plt
from matplotlib.pyplot import GridSpec
import seaborn as sns
import numpy as np
import pandas as pd
import os, sys
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')
sns.set_context("poster", font_scale=1.3)

import missingno as msno
import pandas_profiling

from sklearn.datasets import make_blobs
import time









    



/home/nbuser/anaconda3_410/lib/python3.5/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')
/home/nbuser/anaconda3_410/lib/python3.5/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')

About the Data

The dataset is available as a list with 146,028 dyads of players and referees and includes details from players, details from referees and details regarding the interactions of player-referees. A summary of the variables of interest can be seen below. A detailed description of all variables included can be seen in the README file on the project website.

From a company for sports statistics, we obtained data and profile photos from all soccer players (N = 2,053) playing in the first male divisions of England, Germany, France and Spain in the 2012-2013 season and all referees (N = 3,147) that these players played under in their professional career (see Figure 1). We created a dataset of playerâreferee dyads including the number of matches players and referees encountered each other and our dependent variable, the number of red cards given to a player by a particular referee throughout all matches the two encountered each other.

-- https://docs.google.com/document/d/1uCF5wmbcL90qvrk_J27fWAvDcDNrO9o_APkicwRkOKc/edit

Variable Name:	Variable Description:
playerShort	short player ID
player	player name
club	player club
leagueCountry	country of player club (England, Germany, France, and Spain)
height	player height (in cm)
weight	player weight (in kg)
position	player position
games	number of games in the player-referee dyad
goals	number of goals in the player-referee dyad
yellowCards	number of yellow cards player received from the referee
yellowReds	number of yellow-red cards player received from the referee
redCards	number of red cards player received from the referee
photoID	ID of player photo (if available)
rater1	skin rating of photo by rater 1
rater2	skin rating of photo by rater 2
refNum	unique referee ID number (referee name removed for anonymizing purposes)
refCountry	unique referee country ID number
meanIAT	mean implicit bias score (using the race IAT) for referee country
nIAT	sample size for race IAT in that particular country
seIAT	standard error for mean estimate of race IAT
meanExp	mean explicit bias score (using a racial thermometer task) for referee country
nExp	sample size for explicit bias in that particular country
seExp	standard error for mean estimate of explicit bias measure

What the teams found

Choices in model features

The following is the covariates chosen for the respective models:

Choices in modeling

Of the many choices made by the team, here is a small selection of the models used to answer this question:

Final Results

0 teams: negative effect
9 teams: no significant relationship
20 teams: finding a positive effect

Above image from: http://fivethirtyeight.com/features/science-isnt-broken/#part2

â¦selecting randomly from the present teams, there would have been a 69% probability of reporting a positive result and a 31% probability of reporting a null effect. This raises the possibility that many research projects contain hidden uncertainty due to the wide range of analytic choices available to the researchers. -- Silberzahn, R., Uhlmann, E. L., Martin, D. P., Pasquale, Aust, F., Awtrey, E. C., â¦ Nosek, B. A. (2015, August 20). Many analysts, one dataset: Making transparent how variations in analytical choices affect results. Retrieved from osf.io/gvm2z

Images and data from: Silberzahn, R., Uhlmann, E. L., Martin, D. P., Pasquale, Aust, F., Awtrey, E. C., â¦ Nosek, B. A. (2015, August 20). Many analysts, one dataset: Making transparent how variations in analytical choices affect results. Retrieved from osf.io/gvm2z

General tips

Before plotting/joining/doing something, have a question or hypothesis that you want to investigate
Draw a plot of what you want to see on paper to sketch the idea
Write it down, then make the plan on how to get there
How do you know you aren't fooling yourself
What else can I check if this is actually true?
What evidence could there be that it's wrong?



In [3]:

    
# Uncomment one of the following lines and run the cell:

# df = pd.read_csv("../data/redcard/redcard.csv.gz", compression='gzip')
df = pd.read_csv("https://github.com/cmawer/pycon-2017-eda-tutorial/raw/master/data/redcard/redcard.csv.gz",
                 compression='gzip')



In [4]:

    
df.shape









    Out[4]:





(146028, 28)



In [5]:

    
df.head()









    Out[5]:






  
    
      
      playerShort
      player
      club
      leagueCountry
      birthday
      height
      weight
      position
      games
      victories
      ...
      rater2
      refNum
      refCountry
      Alpha_3
      meanIAT
      nIAT
      seIAT
      meanExp
      nExp
      seExp
    
  
  
    
      0
      lucas-wilchez
      Lucas Wilchez
      Real Zaragoza
      Spain
      31.08.1983
      177.0
      72.0
      Attacking Midfielder
      1
      0
      ...
      0.50
      1
      1
      GRC
      0.326391
      712.0
      0.000564
      0.396000
      750.0
      0.002696
    
    
      1
      john-utaka
      John Utaka
      Montpellier HSC
      France
      08.01.1982
      179.0
      82.0
      Right Winger
      1
      0
      ...
      0.75
      2
      2
      ZMB
      0.203375
      40.0
      0.010875
      -0.204082
      49.0
      0.061504
    
    
      2
      abdon-prats
      AbdÃ³n Prats
      RCD Mallorca
      Spain
      17.12.1992
      181.0
      79.0
      NaN
      1
      0
      ...
      NaN
      3
      3
      ESP
      0.369894
      1785.0
      0.000229
      0.588297
      1897.0
      0.001002
    
    
      3
      pablo-mari
      Pablo MarÃ
      RCD Mallorca
      Spain
      31.08.1993
      191.0
      87.0
      Center Back
      1
      1
      ...
      NaN
      3
      3
      ESP
      0.369894
      1785.0
      0.000229
      0.588297
      1897.0
      0.001002
    
    
      4
      ruben-pena
      RubÃ©n PeÃ±a
      Real Valladolid
      Spain
      18.07.1991
      172.0
      70.0
      Right Midfielder
      1
      1
      ...
      NaN
      3
      3
      ESP
      0.369894
      1785.0
      0.000229
      0.588297
      1897.0
      0.001002
    
  

5 rows Ã 28 columns



In [6]:

    
df.describe().T









    Out[6]:






  
    
      
      count
      mean
      std
      min
      25%
      50%
      75%
      max
    
  
  
    
      height
      145765.0
      181.935938
      6.738726
      1.610000e+02
      177.000000
      182.000000
      187.000000
      2.030000e+02
    
    
      weight
      143785.0
      76.075662
      7.140906
      5.400000e+01
      71.000000
      76.000000
      81.000000
      1.000000e+02
    
    
      games
      146028.0
      2.921166
      3.413633
      1.000000e+00
      1.000000
      2.000000
      3.000000
      4.700000e+01
    
    
      victories
      146028.0
      1.278344
      1.790725
      0.000000e+00
      0.000000
      1.000000
      2.000000
      2.900000e+01
    
    
      ties
      146028.0
      0.708241
      1.116793
      0.000000e+00
      0.000000
      0.000000
      1.000000
      1.400000e+01
    
    
      defeats
      146028.0
      0.934581
      1.383059
      0.000000e+00
      0.000000
      1.000000
      1.000000
      1.800000e+01
    
    
      goals
      146028.0
      0.338058
      0.906481
      0.000000e+00
      0.000000
      0.000000
      0.000000
      2.300000e+01
    
    
      yellowCards
      146028.0
      0.385364
      0.795333
      0.000000e+00
      0.000000
      0.000000
      1.000000
      1.400000e+01
    
    
      yellowReds
      146028.0
      0.011381
      0.107931
      0.000000e+00
      0.000000
      0.000000
      0.000000
      3.000000e+00
    
    
      redCards
      146028.0
      0.012559
      0.112889
      0.000000e+00
      0.000000
      0.000000
      0.000000
      2.000000e+00
    
    
      rater1
      124621.0
      0.264255
      0.295382
      0.000000e+00
      0.000000
      0.250000
      0.250000
      1.000000e+00
    
    
      rater2
      124621.0
      0.302862
      0.293020
      0.000000e+00
      0.000000
      0.250000
      0.500000
      1.000000e+00
    
    
      refNum
      146028.0
      1534.827444
      918.736625
      1.000000e+00
      641.000000
      1604.000000
      2345.000000
      3.147000e+03
    
    
      refCountry
      146028.0
      29.642842
      27.496189
      1.000000e+00
      7.000000
      21.000000
      44.000000
      1.610000e+02
    
    
      meanIAT
      145865.0
      0.346276
      0.032246
      -4.725423e-02
      0.334684
      0.336628
      0.369894
      5.737933e-01
    
    
      nIAT
      145865.0
      19697.411216
      127126.197143
      2.000000e+00
      1785.000000
      2882.000000
      7749.000000
      1.975803e+06
    
    
      seIAT
      145865.0
      0.000631
      0.004736
      2.235373e-07
      0.000055
      0.000151
      0.000229
      2.862871e-01
    
    
      meanExp
      145865.0
      0.452026
      0.217469
      -1.375000e+00
      0.336101
      0.356446
      0.588297
      1.800000e+00
    
    
      nExp
      145865.0
      20440.233860
      130615.745103
      2.000000e+00
      1897.000000
      3011.000000
      7974.000000
      2.029548e+06
    
    
      seExp
      145865.0
      0.002994
      0.019723
      1.043334e-06
      0.000225
      0.000586
      0.001002
      1.060660e+00



In [7]:

    
df.dtypes









    Out[7]:





playerShort       object
player            object
club              object
leagueCountry     object
birthday          object
height           float64
weight           float64
position          object
games              int64
victories          int64
ties               int64
defeats            int64
goals              int64
yellowCards        int64
yellowReds         int64
redCards           int64
photoID           object
rater1           float64
rater2           float64
refNum             int64
refCountry         int64
Alpha_3           object
meanIAT          float64
nIAT             float64
seIAT            float64
meanExp          float64
nExp             float64
seExp            float64
dtype: object



In [8]:

    
all_columns = df.columns.tolist()
all_columns









    Out[8]:





['playerShort',
 'player',
 'club',
 'leagueCountry',
 'birthday',
 'height',
 'weight',
 'position',
 'games',
 'victories',
 'ties',
 'defeats',
 'goals',
 'yellowCards',
 'yellowReds',
 'redCards',
 'photoID',
 'rater1',
 'rater2',
 'refNum',
 'refCountry',
 'Alpha_3',
 'meanIAT',
 'nIAT',
 'seIAT',
 'meanExp',
 'nExp',
 'seExp']

Challenge

Before looking below, try to answer some high level questions about the dataset.

How do we operationalize the question of referees giving more red cards to dark skinned players?

Counterfactual: if the player were lighter, a ref is more likely to have given a yellow or no card for the same offense under the same conditions
Regression: accounting for confounding, darker players have positive coefficient on regression against proportion red/total card

Potential issues

How to combine rater1 and rater2? Average them? What if they disagree? Throw it out?
Is data imbalanced, i.e. red cards are very rare?
Is data biased, i.e. players have different amounts of play time? Is this a summary of their whole career?
How do I know I've accounted for all forms of confounding?

First, is there systematic discrimination across all refs?

Exploration/hypotheses:

Distribution of games played
red cards vs games played
Reds per game played vs total cards per game played by skin color
Distribution of # red, # yellow, total cards, and fraction red per game played for all players by avg skin color
How many refs did players encounter?
Do some clubs play more aggresively and get carded more? Or are more reserved and get less?
Does carding vary by leagueCountry?
Do high scorers get more slack (fewer cards) for the same position?
Are there some referees that give more red/yellow cards than others?
how consistent are raters? Check with Cohen's kappa.
how do red cards vary by position? e.g. defenders get more?
Do players with more games get more cards, and is there difference across skin color?
indication of bias depending on refCountry?

Understand how the data's organized

The dataset is a single csv where it aggregated every interaction between referee and player into a single row. In other words: Referee A refereed Player B in, say, 10 games, and gave 2 redcards during those 10 games. Then there would be a unique row in the dataset that said:

Referee A, Player B, 2 redcards, ...

This has several implications that make this first step to understanding and dealing with this data a bit tricky. First, is that the information about Player B is repeated each time -- meaning if we did a simple average of some metric of we would likely get a misleading result.

For example, asking "what is the average weight of the players?"



In [9]:

    
df.height.mean()









    Out[9]:





181.93593798236887



In [10]:

    
df['height'].mean()









    Out[10]:





181.93593798236887



In [11]:

    
np.mean(df.groupby('playerShort').height.mean())









    Out[11]:





181.74372848007872

Doing a simple average over the rows will risk double-counting the same player multiple times, for a skewed average. The simple (incorrect) average is ~76.075 kg, but the average weight of the players is ~75.639 kg. There are multiple ways of doing this, but doing a groupby on player makes it so that so each player gets counted exactly once.

Not a huge difference in this case but already an illustration of some difficulty.

Tidy Data

Hadley Wickham's concept of a tidy dataset summarized as:

Each variable forms a column

Each observation forms a row

Each type of observational unit forms a table

A longer paper describing this can be found in this pdf.

Having datasets in this form allows for much simpler analyses. So the first step is to try and clean up the dataset into a tidy dataset.

The first step that I am going to take is to break up the dataset into the different observational units. By that I'm going to have separate tables (or dataframes) for:

players
clubs
referees
countries
dyads

Create Tidy Players Table



In [12]:

    
player_index = 'playerShort'
player_cols = [#'player', # drop player name, we have unique identifier
               'birthday',
               'height',
               'weight',
               'position',
               'photoID',
               'rater1',
               'rater2',
              ]



In [13]:

    
# Count the unique variables (if we got different weight values, 
# for example, then we should get more than one unique value in this groupby)
all_cols_unique_players = df.groupby('playerShort').agg({col:'nunique' for col in player_cols})



In [14]:

    
all_cols_unique_players.head()









    Out[14]:






  
    
      
      weight
      birthday
      rater1
      height
      position
      photoID
      rater2
    
    
      playerShort
      
      
      
      
      
      
      
    
  
  
    
      aaron-hughes
      1
      1
      1
      1
      1
      1
      1
    
    
      aaron-hunt
      1
      1
      1
      1
      1
      1
      1
    
    
      aaron-lennon
      1
      1
      1
      1
      1
      1
      1
    
    
      aaron-ramsey
      1
      1
      1
      1
      1
      1
      1
    
    
      abdelhamid-el-kaoutari
      1
      1
      1
      1
      1
      1
      1



In [15]:

    
# If all values are the same per player then this should be empty (and it is!)
all_cols_unique_players[all_cols_unique_players > 1].dropna().head()









    Out[15]:






  
    
      
      weight
      birthday
      rater1
      height
      position
      photoID
      rater2
    
    
      playerShort



In [16]:

    
# A slightly more elegant way to test the uniqueness
all_cols_unique_players[all_cols_unique_players > 1].dropna().shape[0] == 0









    Out[16]:





True

Hooray, our data passed our sanity check. Let's create a function to create a table and run thisÂ check for each table that we create.



In [17]:

    
def get_subgroup(dataframe, g_index, g_columns):
    """Helper function that creates a sub-table from the columns and runs a quick uniqueness test."""
    g = dataframe.groupby(g_index).agg({col:'nunique' for col in g_columns})
    if g[g > 1].dropna().shape[0] != 0:
        print("Warning: you probably assumed this had all unique values but it doesn't.")
    return dataframe.groupby(g_index).agg({col:'max' for col in g_columns})



In [18]:

    
players = get_subgroup(df, player_index, player_cols)
players.head()









    Out[18]:






  
    
      
      weight
      birthday
      rater1
      height
      position
      photoID
      rater2
    
    
      playerShort
      
      
      
      
      
      
      
    
  
  
    
      aaron-hughes
      71.0
      08.11.1979
      0.25
      182.0
      Center Back
      3868.jpg
      0.00
    
    
      aaron-hunt
      73.0
      04.09.1986
      0.00
      183.0
      Attacking Midfielder
      20136.jpg
      0.25
    
    
      aaron-lennon
      63.0
      16.04.1987
      0.25
      165.0
      Right Midfielder
      13515.jpg
      0.25
    
    
      aaron-ramsey
      76.0
      26.12.1990
      0.00
      178.0
      Center Midfielder
      94953.jpg
      0.00
    
    
      abdelhamid-el-kaoutari
      73.0
      17.03.1990
      0.25
      180.0
      Center Back
      124913.jpg
      0.25



In [19]:

    
def save_subgroup(dataframe, g_index, subgroup_name, prefix='raw_'):
    save_subgroup_filename = "".join([prefix, subgroup_name, ".csv.gz"])
    dataframe.to_csv(save_subgroup_filename, compression='gzip', encoding='UTF-8')
    test_df = pd.read_csv(save_subgroup_filename, compression='gzip', index_col=g_index, encoding='UTF-8')
    # Test that we recover what we send in
    if dataframe.equals(test_df):
        print("Test-passed: we recover the equivalent subgroup dataframe.")
    else:
        print("Warning -- equivalence test!!! Double-check.")



In [20]:

    
save_subgroup(players, player_index, "players")









    



Test-passed: we recover the equivalent subgroup dataframe.

Create Tidy Clubs Table

Create the clubs table.



In [21]:

    
club_index = 'club'
club_cols = ['leagueCountry']
clubs = get_subgroup(df, club_index, club_cols)
clubs.head()









    Out[21]:






  
    
      
      leagueCountry
    
    
      club
      
    
  
  
    
      1. FC NÃ¼rnberg
      Germany
    
    
      1. FSV Mainz 05
      Germany
    
    
      1899 Hoffenheim
      Germany
    
    
      AC Ajaccio
      France
    
    
      AFC Bournemouth
      England



In [22]:

    
clubs['leagueCountry'].value_counts()









    Out[22]:





England    48
Spain      27
France     22
Germany    21
Name: leagueCountry, dtype: int64



In [23]:

    
save_subgroup(clubs, club_index, "clubs", )









    



Test-passed: we recover the equivalent subgroup dataframe.

Create Tidy Referees Table



In [24]:

    
referee_index = 'refNum'
referee_cols = ['refCountry']
referees = get_subgroup(df, referee_index, referee_cols)
referees.head()



In [25]:

    
referees.refCountry.nunique()









    Out[25]:





161



In [26]:

    
referees.tail()



In [27]:

    
referees.shape









    Out[27]:





(3147, 1)



In [28]:

    
save_subgroup(referees, referee_index, "referees")









    



Test-passed: we recover the equivalent subgroup dataframe.

Create Tidy Countries Table



In [29]:

    
country_index = 'refCountry'
country_cols = ['Alpha_3', # rename this name of country
                'meanIAT',
                'nIAT',
                'seIAT',
                'meanExp',
                'nExp',
                'seExp',
               ]
countries = get_subgroup(df, country_index, country_cols)
countries.head()



In [30]:

    
rename_columns = {'Alpha_3':'countryName', }
countries = countries.rename(columns=rename_columns)
countries.head()









    Out[30]:






  
    
      
      nIAT
      meanIAT
      countryName
      nExp
      meanExp
      seIAT
      seExp
    
    
      refCountry
      
      
      
      
      
      
      
    
  
  
    
      1
      712.0
      0.326391
      GRC
      750.0
      0.396000
      0.000564
      0.002696
    
    
      2
      40.0
      0.203375
      ZMB
      49.0
      -0.204082
      0.010875
      0.061504
    
    
      3
      1785.0
      0.369894
      ESP
      1897.0
      0.588297
      0.000229
      0.001002
    
    
      4
      127.0
      0.325185
      LUX
      130.0
      0.538462
      0.003297
      0.013752
    
    
      5
      19.0
      0.167132
      TUN
      19.0
      -0.789474
      0.027327
      0.111757



In [31]:

    
countries.shape









    Out[31]:





(161, 7)



In [32]:

    
save_subgroup(countries, country_index, "countries")









    



Warning -- equivalence test!!! Double-check.



In [33]:

    
# Ok testing this out: 
test_df = pd.read_csv("raw_countries.csv.gz", compression='gzip', index_col=country_index)



In [34]:

    
for (_, row1), (_, row2) in zip(test_df.iterrows(), countries.iterrows()):
    if not row1.equals(row2):
        print(row1)
        print()
        print(row2)
        print()
        break









    



nIAT                  127
meanIAT          0.325185
countryName           LUX
nExp                  130
meanExp          0.538462
seIAT          0.00329681
seExp           0.0137522
Name: 4, dtype: object

nIAT                  127
meanIAT          0.325185
countryName           LUX
nExp                  130
meanExp          0.538462
seIAT          0.00329681
seExp           0.0137522
Name: 4, dtype: object



In [35]:

    
row1.eq(row2)









    Out[35]:





nIAT            True
meanIAT         True
countryName     True
nExp            True
meanExp         True
seIAT          False
seExp           True
Name: 4, dtype: bool



In [36]:

    
row1.seIAT - row2.seIAT









    Out[36]:





4.336808689942018e-19



In [37]:

    
countries.dtypes









    Out[37]:





nIAT           float64
meanIAT        float64
countryName     object
nExp           float64
meanExp        float64
seIAT          float64
seExp          float64
dtype: object



In [38]:

    
test_df.dtypes









    Out[38]:





nIAT           float64
meanIAT        float64
countryName     object
nExp           float64
meanExp        float64
seIAT          float64
seExp          float64
dtype: object



In [39]:

    
countries.head()









    Out[39]:






  
    
      
      nIAT
      meanIAT
      countryName
      nExp
      meanExp
      seIAT
      seExp
    
    
      refCountry
      
      
      
      
      
      
      
    
  
  
    
      1
      712.0
      0.326391
      GRC
      750.0
      0.396000
      0.000564
      0.002696
    
    
      2
      40.0
      0.203375
      ZMB
      49.0
      -0.204082
      0.010875
      0.061504
    
    
      3
      1785.0
      0.369894
      ESP
      1897.0
      0.588297
      0.000229
      0.001002
    
    
      4
      127.0
      0.325185
      LUX
      130.0
      0.538462
      0.003297
      0.013752
    
    
      5
      19.0
      0.167132
      TUN
      19.0
      -0.789474
      0.027327
      0.111757



In [40]:

    
test_df.head()









    Out[40]:






  
    
      
      nIAT
      meanIAT
      countryName
      nExp
      meanExp
      seIAT
      seExp
    
    
      refCountry
      
      
      
      
      
      
      
    
  
  
    
      1
      712.0
      0.326391
      GRC
      750.0
      0.396000
      0.000564
      0.002696
    
    
      2
      40.0
      0.203375
      ZMB
      49.0
      -0.204082
      0.010875
      0.061504
    
    
      3
      1785.0
      0.369894
      ESP
      1897.0
      0.588297
      0.000229
      0.001002
    
    
      4
      127.0
      0.325185
      LUX
      130.0
      0.538462
      0.003297
      0.013752
    
    
      5
      19.0
      0.167132
      TUN
      19.0
      -0.789474
      0.027327
      0.111757

Looks like precision error, so I'm not concerned. All other sanity checks pass.



In [41]:

    
countries.tail()









    Out[41]:






  
    
      
      nIAT
      meanIAT
      countryName
      nExp
      meanExp
      seIAT
      seExp
    
    
      refCountry
      
      
      
      
      
      
      
    
  
  
    
      157
      27.0
      0.179002
      MOZ
      31.0
      0.000000
      0.015172
      0.032258
    
    
      158
      402.0
      0.344074
      DOM
      373.0
      0.297587
      0.001123
      0.005557
    
    
      159
      23.0
      0.318392
      CAF
      23.0
      -0.130435
      0.019218
      0.147328
    
    
      160
      90.0
      0.349965
      NPL
      101.0
      0.366337
      0.004713
      0.015700
    
    
      161
      313.0
      0.304238
      BGD
      316.0
      0.585443
      0.001560
      0.012881



In [42]:

    
test_df.tail()









    Out[42]:






  
    
      
      nIAT
      meanIAT
      countryName
      nExp
      meanExp
      seIAT
      seExp
    
    
      refCountry
      
      
      
      
      
      
      
    
  
  
    
      157
      27.0
      0.179002
      MOZ
      31.0
      0.000000
      0.015172
      0.032258
    
    
      158
      402.0
      0.344074
      DOM
      373.0
      0.297587
      0.001123
      0.005557
    
    
      159
      23.0
      0.318392
      CAF
      23.0
      -0.130435
      0.019218
      0.147328
    
    
      160
      90.0
      0.349965
      NPL
      101.0
      0.366337
      0.004713
      0.015700
    
    
      161
      313.0
      0.304238
      BGD
      316.0
      0.585443
      0.001560
      0.012881

Create separate (not yet Tidy) Dyads Table

This is one of the more complex tables to reason about -- so we'll save it for a bit later.



In [43]:

    
dyad_index = ['refNum', 'playerShort']
dyad_cols = ['games',
             'victories',
             'ties',
             'defeats',
             'goals',
             'yellowCards',
             'yellowReds',
             'redCards',
            ]



In [44]:

    
dyads = get_subgroup(df, g_index=dyad_index, g_columns=dyad_cols)



In [45]:

    
dyads.head(10)









    Out[45]:






  
    
      
      
      yellowCards
      yellowReds
      victories
      ties
      games
      defeats
      goals
      redCards
    
    
      refNum
      playerShort
      
      
      
      
      
      
      
      
    
  
  
    
      1
      lucas-wilchez
      0
      0
      0
      0
      1
      1
      0
      0
    
    
      2
      john-utaka
      1
      0
      0
      0
      1
      1
      0
      0
    
    
      3
      abdon-prats
      1
      0
      0
      1
      1
      0
      0
      0
    
    
      pablo-mari
      0
      0
      1
      0
      1
      0
      0
      0
    
    
      ruben-pena
      0
      0
      1
      0
      1
      0
      0
      0
    
    
      4
      aaron-hughes
      0
      0
      0
      0
      1
      1
      0
      0
    
    
      aleksandar-kolarov
      0
      0
      1
      0
      1
      0
      0
      0
    
    
      alexander-tettey
      0
      0
      0
      0
      1
      1
      0
      0
    
    
      anders-lindegaard
      0
      0
      0
      1
      1
      0
      0
      0
    
    
      andreas-beck
      0
      0
      1
      0
      1
      0
      0
      0



In [46]:

    
dyads.shape









    Out[46]:





(146028, 8)



In [47]:

    
dyads[dyads.redCards > 1].head(10)









    Out[47]:






  
    
      
      
      yellowCards
      yellowReds
      victories
      ties
      games
      defeats
      goals
      redCards
    
    
      refNum
      playerShort
      
      
      
      
      
      
      
      
    
  
  
    
      140
      bodipo
      0
      0
      2
      1
      6
      3
      1
      2
    
    
      367
      antonio-lopez_2
      2
      0
      5
      2
      8
      1
      0
      2
    
    
      432
      javi-martinez
      2
      0
      4
      3
      14
      7
      2
      2
    
    
      jonas
      0
      0
      1
      4
      9
      4
      1
      2
    
    
      487
      phil-jagielka
      0
      0
      2
      1
      7
      4
      1
      2
    
    
      586
      cyril-jeunechamp
      6
      0
      8
      0
      14
      6
      0
      2
    
    
      804
      sergio-ramos
      6
      1
      12
      1
      18
      5
      4
      2
    
    
      985
      aly-cissokho
      1
      0
      1
      5
      9
      3
      1
      2
    
    
      1114
      eugen-polanski
      0
      0
      4
      0
      8
      4
      0
      2
    
    
      1214
      emmanuel-adebayor
      4
      1
      9
      7
      23
      7
      10
      2



In [48]:

    
save_subgroup(dyads, dyad_index, "dyads")









    



Test-passed: we recover the equivalent subgroup dataframe.



In [49]:

    
dyads.redCards.max()









    Out[49]:





2

	playerShort	player	club	leagueCountry	birthday	height	weight	position	games	victories	...	rater2	refNum	refCountry	Alpha_3	meanIAT	nIAT	seIAT	meanExp	nExp	seExp
0	lucas-wilchez	Lucas Wilchez	Real Zaragoza	Spain	31.08.1983	177.0	72.0	Attacking Midfielder	1	0	...	0.50	1	1	GRC	0.326391	712.0	0.000564	0.396000	750.0	0.002696
1	john-utaka	John Utaka	Montpellier HSC	France	08.01.1982	179.0	82.0	Right Winger	1	0	...	0.75	2	2	ZMB	0.203375	40.0	0.010875	-0.204082	49.0	0.061504
2	abdon-prats	AbdÃ³n Prats	RCD Mallorca	Spain	17.12.1992	181.0	79.0	NaN	1	0	...	NaN	3	3	ESP	0.369894	1785.0	0.000229	0.588297	1897.0	0.001002
3	pablo-mari	Pablo MarÃ	RCD Mallorca	Spain	31.08.1993	191.0	87.0	Center Back	1	1	...	NaN	3	3	ESP	0.369894	1785.0	0.000229	0.588297	1897.0	0.001002
4	ruben-pena	RubÃ©n PeÃ±a	Real Valladolid	Spain	18.07.1991	172.0	70.0	Right Midfielder	1	1	...	NaN	3	3	ESP	0.369894	1785.0	0.000229	0.588297	1897.0	0.001002

	count	mean	std	min	25%	50%	75%	max
height	145765.0	181.935938	6.738726	1.610000e+02	177.000000	182.000000	187.000000	2.030000e+02
weight	143785.0	76.075662	7.140906	5.400000e+01	71.000000	76.000000	81.000000	1.000000e+02
games	146028.0	2.921166	3.413633	1.000000e+00	1.000000	2.000000	3.000000	4.700000e+01
victories	146028.0	1.278344	1.790725	0.000000e+00	0.000000	1.000000	2.000000	2.900000e+01
ties	146028.0	0.708241	1.116793	0.000000e+00	0.000000	0.000000	1.000000	1.400000e+01
defeats	146028.0	0.934581	1.383059	0.000000e+00	0.000000	1.000000	1.000000	1.800000e+01
goals	146028.0	0.338058	0.906481	0.000000e+00	0.000000	0.000000	0.000000	2.300000e+01
yellowCards	146028.0	0.385364	0.795333	0.000000e+00	0.000000	0.000000	1.000000	1.400000e+01
yellowReds	146028.0	0.011381	0.107931	0.000000e+00	0.000000	0.000000	0.000000	3.000000e+00
redCards	146028.0	0.012559	0.112889	0.000000e+00	0.000000	0.000000	0.000000	2.000000e+00
rater1	124621.0	0.264255	0.295382	0.000000e+00	0.000000	0.250000	0.250000	1.000000e+00
rater2	124621.0	0.302862	0.293020	0.000000e+00	0.000000	0.250000	0.500000	1.000000e+00
refNum	146028.0	1534.827444	918.736625	1.000000e+00	641.000000	1604.000000	2345.000000	3.147000e+03
refCountry	146028.0	29.642842	27.496189	1.000000e+00	7.000000	21.000000	44.000000	1.610000e+02
meanIAT	145865.0	0.346276	0.032246	-4.725423e-02	0.334684	0.336628	0.369894	5.737933e-01
nIAT	145865.0	19697.411216	127126.197143	2.000000e+00	1785.000000	2882.000000	7749.000000	1.975803e+06
seIAT	145865.0	0.000631	0.004736	2.235373e-07	0.000055	0.000151	0.000229	2.862871e-01
meanExp	145865.0	0.452026	0.217469	-1.375000e+00	0.336101	0.356446	0.588297	1.800000e+00
nExp	145865.0	20440.233860	130615.745103	2.000000e+00	1897.000000	3011.000000	7974.000000	2.029548e+06
seExp	145865.0	0.002994	0.019723	1.043334e-06	0.000225	0.000586	0.001002	1.060660e+00

	weight	birthday	rater1	height	position	photoID	rater2
playerShort
aaron-hughes	1	1	1	1	1	1	1
aaron-hunt	1	1	1	1	1	1	1
aaron-lennon	1	1	1	1	1	1	1
aaron-ramsey	1	1	1	1	1	1	1
abdelhamid-el-kaoutari	1	1	1	1	1	1	1

	weight	birthday	rater1	height	position	photoID	rater2
playerShort
aaron-hughes	71.0	08.11.1979	0.25	182.0	Center Back	3868.jpg	0.00
aaron-hunt	73.0	04.09.1986	0.00	183.0	Attacking Midfielder	20136.jpg	0.25
aaron-lennon	63.0	16.04.1987	0.25	165.0	Right Midfielder	13515.jpg	0.25
aaron-ramsey	76.0	26.12.1990	0.00	178.0	Center Midfielder	94953.jpg	0.00
abdelhamid-el-kaoutari	73.0	17.03.1990	0.25	180.0	Center Back	124913.jpg	0.25

	leagueCountry
club
1. FC NÃ¼rnberg	Germany
1. FSV Mainz 05	Germany
1899 Hoffenheim	Germany
AC Ajaccio	France
AFC Bournemouth	England

	nIAT	meanIAT	countryName	nExp	meanExp	seIAT	seExp
refCountry
157	27.0	0.179002	MOZ	31.0	0.000000	0.015172	0.032258
158	402.0	0.344074	DOM	373.0	0.297587	0.001123	0.005557
159	23.0	0.318392	CAF	23.0	-0.130435	0.019218	0.147328
160	90.0	0.349965	NPL	101.0	0.366337	0.004713	0.015700
161	313.0	0.304238	BGD	316.0	0.585443	0.001560	0.012881

		yellowCards	yellowReds	victories	ties	games	defeats	goals	redCards
refNum	playerShort
140	bodipo	0	0	2	1	6	3	1	2
367	antonio-lopez_2	2	0	5	2	8	1	0	2
432	javi-martinez	2	0	4	3	14	7	2	2
432	jonas	0	0	1	4	9	4	1	2
487	phil-jagielka	0	0	2	1	7	4	1	2
586	cyril-jeunechamp	6	0	8	0	14	6	0	2
804	sergio-ramos	6	1	12	1	18	5	4	2
985	aly-cissokho	1	0	1	5	9	3	1	2
1114	eugen-polanski	0	0	4	0	8	4	0	2
1214	emmanuel-adebayor	4	1	9	7	23	7	10	2