I'm naively assuming that, when trying to run as fast as possible, there is a fundamental tradeoff:
Stereotypical sprint runners (e.g. Usain Bolt) tend to be muscular. Stereotypical long distance runners (e.g. Hicham El Guerrouj) tend to be small. We'll here try to find the approximate inflexion point where athletes go from the former "shape" to the second body type.
In [1]:
import numpy
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
pd.set_option('display.mpl_style', 'default')
plt.rcParams['figure.figsize'] = (15, 5)
plt.rcParams['font.family'] = 'sans-serif'
import re
Athlete stats from The Guardian. We're only interested in Athletics here, so drop the other sports/events, and change a few names to get the distance in metres.
In [2]:
df=pd.read_csv("london-2012-athletes.csv")
df=df[df.Sport=="Athletics"]
df.replace(to_replace={'Event':{r'0,000m' : "000m",
r'km' : "000m",
r'Marathon' : "42195m",
r'4 x 100m' : "100m",
r'4 x 400m' : "400m",
}
}, inplace=True, regex=True)
df=df.reset_index()
Some athletes take part in multiple events; split each event into its own row.
In [3]:
s = pd.DataFrame(df.Event.str.split(',').tolist()).stack()
s.index = s.index.droplevel(-1)
s.name='Event'
df.drop('Event', axis=1, inplace=True)
df=df.join(s)
Drop events that aren't runs (throws and jumps) as well as hurdles. Relays are kept in the dataset.
In [4]:
sports_to_ignore = ['Hammer', 'Shot', 'Hurdles', 'Javelin', 'Vault', 'Decathlon', 'Heptathlon', 'Jump', 'Discus', 'Steeplechase', 'Race Walk']
ignore_pattern = '|'.join(sports_to_ignore)
df=df[~df.Event.str.contains(ignore_pattern, flags=re.IGNORECASE)]
df=df.reset_index()
_=df.Event.groupby(df.Event).count().sort_index().plot(kind="bar", title="Number of Athletes per Event")
In [5]:
df['Distance']=df.apply(lambda row: int("".join(filter(str.isdigit, row['Event']))),
axis=1)
df['BMI']=df['Weight']/df['Height, cm']*100
df[['Name', 'Distance', 'BMI', 'Sex']].head()
Out[5]:
In [6]:
fig, axs = plt.subplots(1,1)
mean_m=df[df.Sex=='M'].groupby("Distance").mean()['BMI']
std_m=df[df.Sex=='M'].groupby("Distance").std()['BMI']
mean_f=df[df.Sex=='F'].groupby("Distance").mean()['BMI']
std_f=df[df.Sex=='F'].groupby("Distance").std()['BMI']
mean_m.plot(ax=axs, logx=True, title="BMI vs Distance", marker='o', yerr=std_m)
mean_f.plot(ax=axs, logx=True, marker='o', yerr=std_f)
axs.set_ylim(bottom=20)
_=axs.set_xlim(left=90, right=45000)
What can we conclude?