Redcard Exploratory Data Analysis

This dataset is taken from a fantastic paper that looks to see how analytical choices made by different data science teams on the same dataset in an attempt to answer the same research question affect the final outcome.

Many analysts, one dataset: Making transparent how variations in analytical choices affect results

The data can be found here.

The Task

Do an Exploratory Data Analysis on the redcard dataset. Keeping in mind the question is the following: Are soccer referees more likely to give red cards to dark-skin-toned players than light-skin-toned players?


In [1]:
!conda install -c conda-forge pandas-profiling -y 

!pip install missingno


Fetching package metadata ...........
Solving package specifications: .

Package plan for installation in environment /home/nbcommon/anaconda3_410:

The following NEW packages will be INSTALLED:

    pandas-profiling: 1.3.0-py35_0  conda-forge

The following packages will be SUPERSEDED by a higher-priority channel:

    conda:            4.3.17-py35_0             --> 4.2.13-py35_0 conda-forge
    conda-env:        2.6.0-0                   --> 2.6.0-0       conda-forge

conda-env-2.6. 100% |################################| Time: 0:00:00   1.17 MB/s
conda-4.2.13-p 100% |################################| Time: 0:00:00   1.44 MB/s
pandas-profili 100% |################################| Time: 0:00:00 663.79 kB/s
Collecting missingno
  Downloading missingno-0.3.5.tar.gz
Requirement already satisfied (use --upgrade to upgrade): numpy in /home/nbcommon/anaconda3_410/lib/python3.5/site-packages (from missingno)
Requirement already satisfied (use --upgrade to upgrade): matplotlib in /home/nbcommon/anaconda3_410/lib/python3.5/site-packages (from missingno)
Requirement already satisfied (use --upgrade to upgrade): scipy in /home/nbcommon/anaconda3_410/lib/python3.5/site-packages (from missingno)
Requirement already satisfied (use --upgrade to upgrade): seaborn in /home/nbcommon/anaconda3_410/lib/python3.5/site-packages (from missingno)
Requirement already satisfied (use --upgrade to upgrade): python-dateutil in /home/nbcommon/anaconda3_410/lib/python3.5/site-packages (from matplotlib->missingno)
Requirement already satisfied (use --upgrade to upgrade): pytz in /home/nbcommon/anaconda3_410/lib/python3.5/site-packages (from matplotlib->missingno)
Requirement already satisfied (use --upgrade to upgrade): cycler in /home/nbcommon/anaconda3_410/lib/python3.5/site-packages (from matplotlib->missingno)
Requirement already satisfied (use --upgrade to upgrade): pyparsing!=2.0.4,>=1.5.6 in /home/nbcommon/anaconda3_410/lib/python3.5/site-packages (from matplotlib->missingno)
Requirement already satisfied (use --upgrade to upgrade): six>=1.5 in /home/nbcommon/anaconda3_410/lib/python3.5/site-packages (from python-dateutil->matplotlib->missingno)
Building wheels for collected packages: missingno
  Running setup.py bdist_wheel for missingno ... - \ done
  Stored in directory: /home/nbuser/.cache/pip/wheels/f3/9f/31/38d2fad2bd1ac3ac70a2d159c61515de5825296429a4f13056
Successfully built missingno
Installing collected packages: missingno
Successfully installed missingno-0.3.5
You are using pip version 8.1.2, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

In [2]:
%matplotlib inline
%config InlineBackend.figure_format='retina'

from __future__ import absolute_import, division, print_function
import matplotlib as mpl
from matplotlib import pyplot as plt
from matplotlib.pyplot import GridSpec
import seaborn as sns
import numpy as np
import pandas as pd
import os, sys
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')
sns.set_context("poster", font_scale=1.3)

import missingno as msno
import pandas_profiling

from sklearn.datasets import make_blobs
import time


/home/nbuser/anaconda3_410/lib/python3.5/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')
/home/nbuser/anaconda3_410/lib/python3.5/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')

About the Data

The dataset is available as a list with 146,028 dyads of players and referees and includes details from players, details from referees and details regarding the interactions of player-referees. A summary of the variables of interest can be seen below. A detailed description of all variables included can be seen in the README file on the project website.

From a company for sports statistics, we obtained data and profile photos from all soccer players (N = 2,053) playing in the first male divisions of England, Germany, France and Spain in the 2012-2013 season and all referees (N = 3,147) that these players played under in their professional career (see Figure 1). We created a dataset of player–referee dyads including the number of matches players and referees encountered each other and our dependent variable, the number of red cards given to a player by a particular referee throughout all matches the two encountered each other.

-- https://docs.google.com/document/d/1uCF5wmbcL90qvrk_J27fWAvDcDNrO9o_APkicwRkOKc/edit

Variable Name: Variable Description:
playerShort short player ID
player player name
club player club
leagueCountry country of player club (England, Germany, France, and Spain)
height player height (in cm)
weight player weight (in kg)
position player position
games number of games in the player-referee dyad
goals number of goals in the player-referee dyad
yellowCards number of yellow cards player received from the referee
yellowReds number of yellow-red cards player received from the referee
redCards number of red cards player received from the referee
photoID ID of player photo (if available)
rater1 skin rating of photo by rater 1
rater2 skin rating of photo by rater 2
refNum unique referee ID number (referee name removed for anonymizing purposes)
refCountry unique referee country ID number
meanIAT mean implicit bias score (using the race IAT) for referee country
nIAT sample size for race IAT in that particular country
seIAT standard error for mean estimate of race IAT
meanExp mean explicit bias score (using a racial thermometer task) for referee country
nExp sample size for explicit bias in that particular country
seExp standard error for mean estimate of explicit bias measure

What the teams found

Choices in model features

The following is the covariates chosen for the respective models:

Choices in modeling

Of the many choices made by the team, here is a small selection of the models used to answer this question:

Final Results

  • 0 teams: negative effect
  • 9 teams: no significant relationship
  • 20 teams: finding a positive effect

Above image from: http://fivethirtyeight.com/features/science-isnt-broken/#part2

…selecting randomly from the present teams, there would have been a 69% probability of reporting a positive result and a 31% probability of reporting a null effect. This raises the possibility that many research projects contain hidden uncertainty due to the wide range of analytic choices available to the researchers. -- Silberzahn, R., Uhlmann, E. L., Martin, D. P., Pasquale, Aust, F., Awtrey, E. C., … Nosek, B. A. (2015, August 20). Many analysts, one dataset: Making transparent how variations in analytical choices affect results. Retrieved from osf.io/gvm2z

Images and data from: Silberzahn, R., Uhlmann, E. L., Martin, D. P., Pasquale, Aust, F., Awtrey, E. C., … Nosek, B. A. (2015, August 20). Many analysts, one dataset: Making transparent how variations in analytical choices affect results. Retrieved from osf.io/gvm2z

General tips

  • Before plotting/joining/doing something, have a question or hypothesis that you want to investigate
  • Draw a plot of what you want to see on paper to sketch the idea
  • Write it down, then make the plan on how to get there
  • How do you know you aren't fooling yourself
  • What else can I check if this is actually true?
  • What evidence could there be that it's wrong?

In [3]:
# Uncomment one of the following lines and run the cell:

# df = pd.read_csv("../data/redcard/redcard.csv.gz", compression='gzip')
df = pd.read_csv("https://github.com/cmawer/pycon-2017-eda-tutorial/raw/master/data/redcard/redcard.csv.gz",
                 compression='gzip')

In [4]:
df.shape


Out[4]:
(146028, 28)

In [5]:
df.head()


Out[5]:
playerShort player club leagueCountry birthday height weight position games victories ... rater2 refNum refCountry Alpha_3 meanIAT nIAT seIAT meanExp nExp seExp
0 lucas-wilchez Lucas Wilchez Real Zaragoza Spain 31.08.1983 177.0 72.0 Attacking Midfielder 1 0 ... 0.50 1 1 GRC 0.326391 712.0 0.000564 0.396000 750.0 0.002696
1 john-utaka John Utaka Montpellier HSC France 08.01.1982 179.0 82.0 Right Winger 1 0 ... 0.75 2 2 ZMB 0.203375 40.0 0.010875 -0.204082 49.0 0.061504
2 abdon-prats Abdón Prats RCD Mallorca Spain 17.12.1992 181.0 79.0 NaN 1 0 ... NaN 3 3 ESP 0.369894 1785.0 0.000229 0.588297 1897.0 0.001002
3 pablo-mari Pablo Marí RCD Mallorca Spain 31.08.1993 191.0 87.0 Center Back 1 1 ... NaN 3 3 ESP 0.369894 1785.0 0.000229 0.588297 1897.0 0.001002
4 ruben-pena Rubén Peña Real Valladolid Spain 18.07.1991 172.0 70.0 Right Midfielder 1 1 ... NaN 3 3 ESP 0.369894 1785.0 0.000229 0.588297 1897.0 0.001002

5 rows × 28 columns


In [6]:
df.describe().T


Out[6]:
count mean std min 25% 50% 75% max
height 145765.0 181.935938 6.738726 1.610000e+02 177.000000 182.000000 187.000000 2.030000e+02
weight 143785.0 76.075662 7.140906 5.400000e+01 71.000000 76.000000 81.000000 1.000000e+02
games 146028.0 2.921166 3.413633 1.000000e+00 1.000000 2.000000 3.000000 4.700000e+01
victories 146028.0 1.278344 1.790725 0.000000e+00 0.000000 1.000000 2.000000 2.900000e+01
ties 146028.0 0.708241 1.116793 0.000000e+00 0.000000 0.000000 1.000000 1.400000e+01
defeats 146028.0 0.934581 1.383059 0.000000e+00 0.000000 1.000000 1.000000 1.800000e+01
goals 146028.0 0.338058 0.906481 0.000000e+00 0.000000 0.000000 0.000000 2.300000e+01
yellowCards 146028.0 0.385364 0.795333 0.000000e+00 0.000000 0.000000 1.000000 1.400000e+01
yellowReds 146028.0 0.011381 0.107931 0.000000e+00 0.000000 0.000000 0.000000 3.000000e+00
redCards 146028.0 0.012559 0.112889 0.000000e+00 0.000000 0.000000 0.000000 2.000000e+00
rater1 124621.0 0.264255 0.295382 0.000000e+00 0.000000 0.250000 0.250000 1.000000e+00
rater2 124621.0 0.302862 0.293020 0.000000e+00 0.000000 0.250000 0.500000 1.000000e+00
refNum 146028.0 1534.827444 918.736625 1.000000e+00 641.000000 1604.000000 2345.000000 3.147000e+03
refCountry 146028.0 29.642842 27.496189 1.000000e+00 7.000000 21.000000 44.000000 1.610000e+02
meanIAT 145865.0 0.346276 0.032246 -4.725423e-02 0.334684 0.336628 0.369894 5.737933e-01
nIAT 145865.0 19697.411216 127126.197143 2.000000e+00 1785.000000 2882.000000 7749.000000 1.975803e+06
seIAT 145865.0 0.000631 0.004736 2.235373e-07 0.000055 0.000151 0.000229 2.862871e-01
meanExp 145865.0 0.452026 0.217469 -1.375000e+00 0.336101 0.356446 0.588297 1.800000e+00
nExp 145865.0 20440.233860 130615.745103 2.000000e+00 1897.000000 3011.000000 7974.000000 2.029548e+06
seExp 145865.0 0.002994 0.019723 1.043334e-06 0.000225 0.000586 0.001002 1.060660e+00

In [7]:
df.dtypes


Out[7]:
playerShort       object
player            object
club              object
leagueCountry     object
birthday          object
height           float64
weight           float64
position          object
games              int64
victories          int64
ties               int64
defeats            int64
goals              int64
yellowCards        int64
yellowReds         int64
redCards           int64
photoID           object
rater1           float64
rater2           float64
refNum             int64
refCountry         int64
Alpha_3           object
meanIAT          float64
nIAT             float64
seIAT            float64
meanExp          float64
nExp             float64
seExp            float64
dtype: object

In [8]:
all_columns = df.columns.tolist()
all_columns


Out[8]:
['playerShort',
 'player',
 'club',
 'leagueCountry',
 'birthday',
 'height',
 'weight',
 'position',
 'games',
 'victories',
 'ties',
 'defeats',
 'goals',
 'yellowCards',
 'yellowReds',
 'redCards',
 'photoID',
 'rater1',
 'rater2',
 'refNum',
 'refCountry',
 'Alpha_3',
 'meanIAT',
 'nIAT',
 'seIAT',
 'meanExp',
 'nExp',
 'seExp']

Challenge

Before looking below, try to answer some high level questions about the dataset.

How do we operationalize the question of referees giving more red cards to dark skinned players?

  • Counterfactual: if the player were lighter, a ref is more likely to have given a yellow or no card for the same offense under the same conditions
  • Regression: accounting for confounding, darker players have positive coefficient on regression against proportion red/total card

Potential issues

  • How to combine rater1 and rater2? Average them? What if they disagree? Throw it out?
  • Is data imbalanced, i.e. red cards are very rare?
  • Is data biased, i.e. players have different amounts of play time? Is this a summary of their whole career?
  • How do I know I've accounted for all forms of confounding?

First, is there systematic discrimination across all refs?

Exploration/hypotheses:

  • Distribution of games played
  • red cards vs games played
  • Reds per game played vs total cards per game played by skin color
  • Distribution of # red, # yellow, total cards, and fraction red per game played for all players by avg skin color
  • How many refs did players encounter?
  • Do some clubs play more aggresively and get carded more? Or are more reserved and get less?
  • Does carding vary by leagueCountry?
  • Do high scorers get more slack (fewer cards) for the same position?
  • Are there some referees that give more red/yellow cards than others?
  • how consistent are raters? Check with Cohen's kappa.
  • how do red cards vary by position? e.g. defenders get more?
  • Do players with more games get more cards, and is there difference across skin color?
  • indication of bias depending on refCountry?

Understand how the data's organized

The dataset is a single csv where it aggregated every interaction between referee and player into a single row. In other words: Referee A refereed Player B in, say, 10 games, and gave 2 redcards during those 10 games. Then there would be a unique row in the dataset that said:

Referee A, Player B, 2 redcards, ... 

This has several implications that make this first step to understanding and dealing with this data a bit tricky. First, is that the information about Player B is repeated each time -- meaning if we did a simple average of some metric of we would likely get a misleading result.

For example, asking "what is the average weight of the players?"


In [9]:
df.height.mean()


Out[9]:
181.93593798236887

In [10]:
df['height'].mean()


Out[10]:
181.93593798236887

In [11]:
np.mean(df.groupby('playerShort').height.mean())


Out[11]:
181.74372848007872

Doing a simple average over the rows will risk double-counting the same player multiple times, for a skewed average. The simple (incorrect) average is ~76.075 kg, but the average weight of the players is ~75.639 kg. There are multiple ways of doing this, but doing a groupby on player makes it so that so each player gets counted exactly once.

Not a huge difference in this case but already an illustration of some difficulty.

Tidy Data

Hadley Wickham's concept of a tidy dataset summarized as:

  • Each variable forms a column
  • Each observation forms a row
  • Each type of observational unit forms a table

A longer paper describing this can be found in this pdf.

Having datasets in this form allows for much simpler analyses. So the first step is to try and clean up the dataset into a tidy dataset.

The first step that I am going to take is to break up the dataset into the different observational units. By that I'm going to have separate tables (or dataframes) for:

  • players
  • clubs
  • referees
  • countries
  • dyads

Create Tidy Players Table


In [12]:
player_index = 'playerShort'
player_cols = [#'player', # drop player name, we have unique identifier
               'birthday',
               'height',
               'weight',
               'position',
               'photoID',
               'rater1',
               'rater2',
              ]

In [13]:
# Count the unique variables (if we got different weight values, 
# for example, then we should get more than one unique value in this groupby)
all_cols_unique_players = df.groupby('playerShort').agg({col:'nunique' for col in player_cols})

In [14]:
all_cols_unique_players.head()


Out[14]:
weight birthday rater1 height position photoID rater2
playerShort
aaron-hughes 1 1 1 1 1 1 1
aaron-hunt 1 1 1 1 1 1 1
aaron-lennon 1 1 1 1 1 1 1
aaron-ramsey 1 1 1 1 1 1 1
abdelhamid-el-kaoutari 1 1 1 1 1 1 1

In [15]:
# If all values are the same per player then this should be empty (and it is!)
all_cols_unique_players[all_cols_unique_players > 1].dropna().head()


Out[15]:
weight birthday rater1 height position photoID rater2
playerShort

In [16]:
# A slightly more elegant way to test the uniqueness
all_cols_unique_players[all_cols_unique_players > 1].dropna().shape[0] == 0


Out[16]:
True

Hooray, our data passed our sanity check. Let's create a function to create a table and run this check for each table that we create.


In [17]:
def get_subgroup(dataframe, g_index, g_columns):
    """Helper function that creates a sub-table from the columns and runs a quick uniqueness test."""
    g = dataframe.groupby(g_index).agg({col:'nunique' for col in g_columns})
    if g[g > 1].dropna().shape[0] != 0:
        print("Warning: you probably assumed this had all unique values but it doesn't.")
    return dataframe.groupby(g_index).agg({col:'max' for col in g_columns})

In [18]:
players = get_subgroup(df, player_index, player_cols)
players.head()


Out[18]:
weight birthday rater1 height position photoID rater2
playerShort
aaron-hughes 71.0 08.11.1979 0.25 182.0 Center Back 3868.jpg 0.00
aaron-hunt 73.0 04.09.1986 0.00 183.0 Attacking Midfielder 20136.jpg 0.25
aaron-lennon 63.0 16.04.1987 0.25 165.0 Right Midfielder 13515.jpg 0.25
aaron-ramsey 76.0 26.12.1990 0.00 178.0 Center Midfielder 94953.jpg 0.00
abdelhamid-el-kaoutari 73.0 17.03.1990 0.25 180.0 Center Back 124913.jpg 0.25

In [19]:
def save_subgroup(dataframe, g_index, subgroup_name, prefix='raw_'):
    save_subgroup_filename = "".join([prefix, subgroup_name, ".csv.gz"])
    dataframe.to_csv(save_subgroup_filename, compression='gzip', encoding='UTF-8')
    test_df = pd.read_csv(save_subgroup_filename, compression='gzip', index_col=g_index, encoding='UTF-8')
    # Test that we recover what we send in
    if dataframe.equals(test_df):
        print("Test-passed: we recover the equivalent subgroup dataframe.")
    else:
        print("Warning -- equivalence test!!! Double-check.")

In [20]:
save_subgroup(players, player_index, "players")


Test-passed: we recover the equivalent subgroup dataframe.

Create Tidy Clubs Table

Create the clubs table.


In [21]:
club_index = 'club'
club_cols = ['leagueCountry']
clubs = get_subgroup(df, club_index, club_cols)
clubs.head()


Out[21]:
leagueCountry
club
1. FC Nürnberg Germany
1. FSV Mainz 05 Germany
1899 Hoffenheim Germany
AC Ajaccio France
AFC Bournemouth England

In [22]:
clubs['leagueCountry'].value_counts()


Out[22]:
England    48
Spain      27
France     22
Germany    21
Name: leagueCountry, dtype: int64

In [23]:
save_subgroup(clubs, club_index, "clubs", )


Test-passed: we recover the equivalent subgroup dataframe.

Create Tidy Referees Table


In [24]:
referee_index = 'refNum'
referee_cols = ['refCountry']
referees = get_subgroup(df, referee_index, referee_cols)
referees.head()


Out[24]:
refCountry
refNum
1 1
2 2
3 3
4 4
5 5

In [25]:
referees.refCountry.nunique()


Out[25]:
161

In [26]:
referees.tail()


Out[26]:
refCountry
refNum
3143 51
3144 55
3145 21
3146 51
3147 21

In [27]:
referees.shape


Out[27]:
(3147, 1)

In [28]:
save_subgroup(referees, referee_index, "referees")


Test-passed: we recover the equivalent subgroup dataframe.

Create Tidy Countries Table


In [29]:
country_index = 'refCountry'
country_cols = ['Alpha_3', # rename this name of country
                'meanIAT',
                'nIAT',
                'seIAT',
                'meanExp',
                'nExp',
                'seExp',
               ]
countries = get_subgroup(df, country_index, country_cols)
countries.head()


Out[29]:
nIAT meanIAT Alpha_3 nExp meanExp seIAT seExp
refCountry
1 712.0 0.326391 GRC 750.0 0.396000 0.000564 0.002696
2 40.0 0.203375 ZMB 49.0 -0.204082 0.010875 0.061504
3 1785.0 0.369894 ESP 1897.0 0.588297 0.000229 0.001002
4 127.0 0.325185 LUX 130.0 0.538462 0.003297 0.013752
5 19.0 0.167132 TUN 19.0 -0.789474 0.027327 0.111757

In [30]:
rename_columns = {'Alpha_3':'countryName', }
countries = countries.rename(columns=rename_columns)
countries.head()


Out[30]:
nIAT meanIAT countryName nExp meanExp seIAT seExp
refCountry
1 712.0 0.326391 GRC 750.0 0.396000 0.000564 0.002696
2 40.0 0.203375 ZMB 49.0 -0.204082 0.010875 0.061504
3 1785.0 0.369894 ESP 1897.0 0.588297 0.000229 0.001002
4 127.0 0.325185 LUX 130.0 0.538462 0.003297 0.013752
5 19.0 0.167132 TUN 19.0 -0.789474 0.027327 0.111757

In [31]:
countries.shape


Out[31]:
(161, 7)

In [32]:
save_subgroup(countries, country_index, "countries")


Warning -- equivalence test!!! Double-check.

In [33]:
# Ok testing this out: 
test_df = pd.read_csv("raw_countries.csv.gz", compression='gzip', index_col=country_index)

In [34]:
for (_, row1), (_, row2) in zip(test_df.iterrows(), countries.iterrows()):
    if not row1.equals(row2):
        print(row1)
        print()
        print(row2)
        print()
        break


nIAT                  127
meanIAT          0.325185
countryName           LUX
nExp                  130
meanExp          0.538462
seIAT          0.00329681
seExp           0.0137522
Name: 4, dtype: object

nIAT                  127
meanIAT          0.325185
countryName           LUX
nExp                  130
meanExp          0.538462
seIAT          0.00329681
seExp           0.0137522
Name: 4, dtype: object


In [35]:
row1.eq(row2)


Out[35]:
nIAT            True
meanIAT         True
countryName     True
nExp            True
meanExp         True
seIAT          False
seExp           True
Name: 4, dtype: bool

In [36]:
row1.seIAT - row2.seIAT


Out[36]:
4.336808689942018e-19

In [37]:
countries.dtypes


Out[37]:
nIAT           float64
meanIAT        float64
countryName     object
nExp           float64
meanExp        float64
seIAT          float64
seExp          float64
dtype: object

In [38]:
test_df.dtypes


Out[38]:
nIAT           float64
meanIAT        float64
countryName     object
nExp           float64
meanExp        float64
seIAT          float64
seExp          float64
dtype: object

In [39]:
countries.head()


Out[39]:
nIAT meanIAT countryName nExp meanExp seIAT seExp
refCountry
1 712.0 0.326391 GRC 750.0 0.396000 0.000564 0.002696
2 40.0 0.203375 ZMB 49.0 -0.204082 0.010875 0.061504
3 1785.0 0.369894 ESP 1897.0 0.588297 0.000229 0.001002
4 127.0 0.325185 LUX 130.0 0.538462 0.003297 0.013752
5 19.0 0.167132 TUN 19.0 -0.789474 0.027327 0.111757

In [40]:
test_df.head()


Out[40]:
nIAT meanIAT countryName nExp meanExp seIAT seExp
refCountry
1 712.0 0.326391 GRC 750.0 0.396000 0.000564 0.002696
2 40.0 0.203375 ZMB 49.0 -0.204082 0.010875 0.061504
3 1785.0 0.369894 ESP 1897.0 0.588297 0.000229 0.001002
4 127.0 0.325185 LUX 130.0 0.538462 0.003297 0.013752
5 19.0 0.167132 TUN 19.0 -0.789474 0.027327 0.111757

Looks like precision error, so I'm not concerned. All other sanity checks pass.


In [41]:
countries.tail()


Out[41]:
nIAT meanIAT countryName nExp meanExp seIAT seExp
refCountry
157 27.0 0.179002 MOZ 31.0 0.000000 0.015172 0.032258
158 402.0 0.344074 DOM 373.0 0.297587 0.001123 0.005557
159 23.0 0.318392 CAF 23.0 -0.130435 0.019218 0.147328
160 90.0 0.349965 NPL 101.0 0.366337 0.004713 0.015700
161 313.0 0.304238 BGD 316.0 0.585443 0.001560 0.012881

In [42]:
test_df.tail()


Out[42]:
nIAT meanIAT countryName nExp meanExp seIAT seExp
refCountry
157 27.0 0.179002 MOZ 31.0 0.000000 0.015172 0.032258
158 402.0 0.344074 DOM 373.0 0.297587 0.001123 0.005557
159 23.0 0.318392 CAF 23.0 -0.130435 0.019218 0.147328
160 90.0 0.349965 NPL 101.0 0.366337 0.004713 0.015700
161 313.0 0.304238 BGD 316.0 0.585443 0.001560 0.012881

Create separate (not yet Tidy) Dyads Table

This is one of the more complex tables to reason about -- so we'll save it for a bit later.


In [43]:
dyad_index = ['refNum', 'playerShort']
dyad_cols = ['games',
             'victories',
             'ties',
             'defeats',
             'goals',
             'yellowCards',
             'yellowReds',
             'redCards',
            ]

In [44]:
dyads = get_subgroup(df, g_index=dyad_index, g_columns=dyad_cols)

In [45]:
dyads.head(10)


Out[45]:
yellowCards yellowReds victories ties games defeats goals redCards
refNum playerShort
1 lucas-wilchez 0 0 0 0 1 1 0 0
2 john-utaka 1 0 0 0 1 1 0 0
3 abdon-prats 1 0 0 1 1 0 0 0
pablo-mari 0 0 1 0 1 0 0 0
ruben-pena 0 0 1 0 1 0 0 0
4 aaron-hughes 0 0 0 0 1 1 0 0
aleksandar-kolarov 0 0 1 0 1 0 0 0
alexander-tettey 0 0 0 0 1 1 0 0
anders-lindegaard 0 0 0 1 1 0 0 0
andreas-beck 0 0 1 0 1 0 0 0

In [46]:
dyads.shape


Out[46]:
(146028, 8)

In [47]:
dyads[dyads.redCards > 1].head(10)


Out[47]:
yellowCards yellowReds victories ties games defeats goals redCards
refNum playerShort
140 bodipo 0 0 2 1 6 3 1 2
367 antonio-lopez_2 2 0 5 2 8 1 0 2
432 javi-martinez 2 0 4 3 14 7 2 2
jonas 0 0 1 4 9 4 1 2
487 phil-jagielka 0 0 2 1 7 4 1 2
586 cyril-jeunechamp 6 0 8 0 14 6 0 2
804 sergio-ramos 6 1 12 1 18 5 4 2
985 aly-cissokho 1 0 1 5 9 3 1 2
1114 eugen-polanski 0 0 4 0 8 4 0 2
1214 emmanuel-adebayor 4 1 9 7 23 7 10 2

In [48]:
save_subgroup(dyads, dyad_index, "dyads")


Test-passed: we recover the equivalent subgroup dataframe.

In [49]:
dyads.redCards.max()


Out[49]:
2