https://osf.io/gvm2z/Initial data visualisation from Team 23: (Tom Stafford, Mat Evans, Colin Bannard & Tim Heaton)
Here we present the code, results and commentary of our initial exploration of the dataset. Our final analysis (lead: Tom Stafford) was conducted in R, using the packages lme4 (lead: Tim Heaton) and MCMCglmm (lead: Colin Bannard). All code for this is available on the OSF. This analysis was dependent on the exploration presented below (lead: Mat Evans).
Enjoy our final team report here
To get started, let's import some libraries
In [1]:
# IMPORTING LIBRARIES
print "importing libraries"
import pandas as pd # for dealing with csv import
import numpy as np # arrays and other matlab like manipulation
import os # for joining paths and filenames sensibly
import matplotlib.pyplot as plt # Matplotlib's pyplot: MATLAB-like syntax
import scipy.stats.mstats as ssm # for bootstrap
from scipy.stats import gaussian_kde as kde
import random
%matplotlib inline
import seaborn as sns # For pretty plots
# from mpld3 import display_d3
# mpld3.enable_notebook() # Uncomment these lines to use interactive plots as a default. This can lead to slow loading of the notebook
# mpld3.disable_notebook()
Now we need some data.
In [2]:
# Import original data, then do a bit of data-munging to get it in displayable form.
print "loading data from file"
filename=os.path.join('data','CrowdstormingDataJuly1st.csv')
df = pd.read_csv(filename)
By default the data is in a rather counter-intuitive format: referee-player dyads. What is a dyad..?
In [3]:
from IPython.display import HTML
HTML('<iframe src=http://en.wikipedia.org/wiki/Dyad_(sociology) width=1000 height=350></iframe>')
Out[3]:
A referee-player dyad describes the interactions between a particular ref and one player. This means that each row in the dataset is of a unique player-ref combination, listing all of the games by a given player with a particular referee at any point in his career. Let's look at the first few rows of the dataset as an example:
In [4]:
# Display the first 10 rows of the dataset. Only 13 columns for space reasons
df.ix[:10,:13]
Out[4]:
In [5]:
# Display the other columns too
df.ix[:10,13:28]
Out[5]:
We already see some strange things in the data. Some refs officiated very few games. The two raters disagree about player skintone quite often. For some players there isn't a photo, so their skin tone couldn't be rated. Most dyads don't feature cards at all. In general, it's difficult to get an intuition for what the population looks like from inspecting small samples of the data. This is particularly difficult in the dyad format, as each player or ref's cards are spread un-evenly across the dataset.
Team Sheffield felt a more natural format for the data was to disaggregate it. In other words, to unpack each dyad into singular ref-player interaction. In other words, each time a player and a referee met they would contribute one row to the dataset (and thus that interaction has a maximum of one red card).
If you're interested in how the disaggregation was done, the code is in (our project folder on the OSF](https://osf.io/akqt4/) - the file is disaggregate_v3.py
In [6]:
# Load disaggregated data
print "loading disaggregated data"
filename=os.path.join('data','crowdstorm_disaggregated.csv')
dfd = pd.read_csv(filename)
We now have a much more 'normal' dataset, where each data point accounts for a single interaction (i.e. a game) .
In [7]:
# Pull out the number of games in each dyad and plot
games = dfd.games
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 4))
axes[0].hist(games,bins=max(games),histtype = 'stepfilled')
axes[0].set_xlabel('Number of interactions')
axes[0].set_ylabel('Frequency')
axes[1].hist(games,bins=max(games),histtype = 'stepfilled')
axes[1].set_yscale('symlog') # symetric log scale NOTE this breaks the nice plotting tools
axes[1].set_title("Log scaled")
axes[1].set_xlabel('Number of interactions')
axes[1].set_ylabel('Log Frequency')
fig.tight_layout()
# display_d3()
print "highest number of games in a single dyad = " + str(max(games))
Half of the dyads are unique, with the other half following an exponential decay all the way out to a high of 47 games, which is actually a pretty huge number. Who are the players featuring in these high (>=40, say) game dyads (ie the guys who meet the same ref 40+ times)?
In [8]:
stalwarts = dfd[dfd.games>=40]
print stalwarts.player.unique()
Now, that at least makes sense. We see that many England Internationals, and two World Cup winning German Internationals (each with long careers in top-flight football) have played with the same referees many times.
By contrast, how rare are red cards?
In [9]:
print "Total interactions =", len(dfd)
clean_interactions = dfd[(dfd['allreds'] == 0)]
print "Number of red cards in the dataset =", len(dfd) - len(clean_interactions)
print "Number of interactions without a red card =", len(clean_interactions)
print "Proportion of interactions that are 'clean' =", len(clean_interactions) / float(len(dfd))
In sum, red cards are very rare. We also showed that the distributions of player and ref occurance are highly skewed. Therefore, any analysis method applied to this population needs to be able to handle these properties of the data set.
In [10]:
allRefs = dfd.refNum.value_counts()
print "Number of refs =", len(allRefs)
print "Number of dyads in the dataset =", sum(allRefs)
In [11]:
import mpld3
mpld3.enable_notebook()
# Histogram of country frequency.
fig, ax = plt.subplots(1,1,figsize=(12, 4))
x = dfd.Alpha_3.value_counts()
lines = ax.plot(x,marker='.',ms=20)
y = x.index.tolist()
tooltips = mpld3.plugins.PointLabelTooltip(lines[0], labels=y)
mpld3.plugins.connect(plt.gcf(), tooltips)
ax.set_title('Referee nationality by number of games')
ax.set_xlabel('Country number (ordered by frequency)')
ax.set_ylabel('Frequency of games')
ax.set_xlim([-3,160]) # a hack so we can see the first point most clearly
Out[11]:
In [12]:
mpld3.disable_notebook()
Most games are ref'ed by someone from one of a small number of countries, as we would expect - the four premier leagues which defined selection for our data set: england, germany, france and spain.
However, it also seems like 160 different nationalities are represented by our referees. This seemed unlikely, are there really refs from almost every country on earth in the four premier leagues in the season 2012-13?
In [13]:
numRefs = len(dfd.refNum.value_counts())
print "Total number of referees =", numRefs
print "Median number of dyads per referee =", np.median(dfd.refNum.value_counts())
fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(12, 4))
axes.hist(dfd.refNum.value_counts().tolist(),bins=numRefs)
axes.set_xscale('symlog') # symetric log scale
axes.set_yscale('symlog')
axes.set_title("Referee occurance, log scaled")
axes.set_xlabel('log (number of occurances)')
axes.set_ylabel('log (frequency)')
Out[13]:
We see that though most refs are only involved in a small number of dyads, many officiated over thousands. A median of 11 indicates that more than half of the refs officiated less than one game!
Something funny is going on - If a ref officiated a full game in one of our selected premier leagues they would be in at least 22 dyads (2 teams of 11 players each, more if substitutions occur).
Futher analysis (not shown), including a bit of eyeballing, revealled that players' entire career histories are included in the dataset. This means that if a player gets sent off in a Uruguayan league game in 2002, but then transfers to the English premier league by 2012 then this booking ends up in our dataset - as does the referee. This explains the high number of ref nationalities present in the data set
In [14]:
goodRefs = allRefs[allRefs>21]
In [15]:
fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(12, 4))
axes.hist(goodRefs.tolist(),numRefs-11)
axes.set_xscale('symlog') # symetric log scale
plt.xlim([1,10000])
axes.set_yscale('symlog')
plt.ylim([0,1000])
axes.set_title("Referee occurance following our cull, log scaled")
axes.set_xlabel('log (number of occurances)')
axes.set_ylabel('log (frequency)')
Out[15]:
In [16]:
print "Number of refs featuring in at least 22 dyads =",len(goodRefs)
print "Number of dyads, excluding refs who officiate fewer than 22 games =", sum(goodRefs)
We lose approx 2/3rds of refs, but keep 97.4% of dyads
In [17]:
#Copying from
#http://stackoverflow.com/questions/12065885/how-to-filter-the-dataframe-rows-of-pandas-by-within-in
#
#This line defines a new dataframe based on our >21 games filter
dfd_good=dfd[dfd['refNum'].isin(goodRefs.index.values)]
#pandas is like being in a alien spaceship
#- you know you potentially control unimaginable power, but don't know what any of the buttongs actually do
#
In [18]:
import mpld3
mpld3.enable_notebook()
# Histogram of country frequency.
fig, ax = plt.subplots(1,1,figsize=(12, 4))
x = dfd_good.Alpha_3.value_counts()
lines = ax.plot(x,marker='.',ms=20)
y = x.index.tolist()
tooltips = mpld3.plugins.PointLabelTooltip(lines[0], labels=y)
mpld3.plugins.connect(plt.gcf(), tooltips)
ax.set_title('Referee nationality by number of games')
ax.set_xlabel('Country number (ordered by frequency)')
ax.set_ylabel('Frequency of games')
ax.set_xlim([-3,160]) # a hack so we can see the first point most clearly
Out[18]:
In [19]:
mpld3.disable_notebook()
Still 100+ nationalities represented
We noted that ratings of skintone could be more reliable. The ratings are fairly different at the light end of the spectrum. The two raters disagree on 28742 or 19% of the time, and looking at the histograms of their responses most of these are between the first two categories. These first two categories account for ~ 70% of both rater's classifications, so biases/inconsistencies/uncertainty in this part of the dataset could have a large effect on the rest of the analysis. There could be many reasons for this, but one obvious way of dealing with it would be to use N>2 raters.
In [20]:
# Plot of skin tone rating distributions, showing skewed nature of the data w/ histograms
# and degree of disagreement between raters with a scatterplot
rated = ((dfd['rater1']+dfd['rater2'])/2).dropna()
fig, ax = plt.subplots(1,4,figsize=(12, 4))
c = sns.color_palette()
ax[0].hist(rated,bins = 5, range = (0,1),color = c[0])
ax[0].set_title("Mean rating")
ax[1].hist(dfd['rater1'].dropna().tolist(),bins = 5, range = (0,1), color = c[1])
ax[1].set_title('Rater 1')
ax[2].hist(dfd['rater2'].dropna().tolist(),bins = 5, range = (0,1),color = c[2])
ax[2].set_title('Rater 2')
ax[3].hist((dfd['rater1'] - dfd['rater2']).dropna(), bins = 5,range = (-0.5,0.5),color = c[3])
ax[3].set_title('Difference')
fig.tight_layout()
print 'Mean skintone across the population =', np.mean(rated)
In [21]:
c = sns.color_palette()
jitter_x = np.random.normal(0, 0.04, size=len(dfd.rater1))
jitter_y = np.random.normal(0, 0.04, size=len(dfd.rater2))
sns.jointplot(dfd.rater1 + jitter_x, dfd.rater2 + jitter_y, kind='kde')
Out[21]:
Most of the refs come from 4 countries (where the 4 represented leagues are). It's a shame there is no data from Italy but never mind. It's hard to know how reliable the IAT and Exp scores are with respect to the other countries, esp if there is some variation between ref biases within a country. The IAT and Exp scores might not have as much variation as within country variation..
There doesn't seem to be much 'signal' here. Also the sample sizes, though large, are then undermined by the small sample sizes for refs in some countries (data not shown, but see above)
In [22]:
# Linked subplots showing these distributions, or a scatter plot with variable dot sizes showing how much overlap there is.
fig, ax = plt.subplots(2,2,figsize=(12, 8))
ax[0,0].hist(dfd.meanIAT.dropna().unique())
ax[0,0].set_title("Mean IAT - all")
ax[0,1].hist(dfd['meanExp'].dropna().unique())
ax[0,1].set_title("Mean Exp - all")
ax[1,0].hist(dfd_good.meanIAT.dropna().unique())
ax[1,0].set_title("Mean IAT - culled")
ax[1,1].hist(dfd_good['meanExp'].dropna().unique())
ax[1,1].set_title("Mean Exp - culled")
Out[22]:
In [23]:
sns.jointplot(dfd_good['meanIAT'].dropna().unique(),dfd_good['meanExp'].dropna().unique())
Out[23]:
implicit (IAT) and explicit (EXP) attitude correlate highly
In [24]:
x = dfd_good.nIAT.dropna().unique()
plt.plot(np.sort(x),'.')
plt.xlabel('Country number (ordered by sample size)')
plt.ylabel('Sample size for IAT score')
plt.ylim([0,3000])
plt.title('Most IAT scores are based on samples of <1000')
Out[24]:
So perhaps not too surprising that the country attitude scores (both explicit and implicit) don't predict carding by individual referrees
At least half of the analysis effort was the initial data exploration and visualisation. Normally this is not shown directly in the reporting of a scientific project. We felt, particularly for this project, that it was worth bringing to light some of this work.
The ipython notebook allows code, commentary and results to be combined (and, if you run it in interactive mode, to be actively developed between several people)
There is a trend in behavioural science to very large datasets. When these are incidentally collected (e.g. data provided by companies or bureaucracies), or when they are collected via highly complex/opaque techniques (e.g. many kinds of imaging data) we, as scientists, are often ignorant of the exact form the data take, and the attendant peculiarities and nuances. This means that every analysis project is more and more a visualisation project as well.
Mat Evans @mathewe
Tom Stafford @tomstafford
Nov 2014
In [ ]: