Today, we'll go with our first option, download data from http://nflsavant.com as csv using wget. We can then load this local file using pandas read_csv. read_csv can also read the csv data directly from the URL, but this way we don't have to download the file each time we load our data frame. Something I'm sure the owner of the website will appreciate.
In [1]:
!wget http://nflsavant.com/pbp_data.php?year=2015 -O pbp-2015.csv
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
sns.set_context("talk")
plt.figure(figsize=(10, 8))
Out[1]:
In [2]:
df = pd.read_csv('pbp-2015.csv')
In [3]:
# What do we have?
df.columns
Out[3]:
In [4]:
def event_to_datetime(row):
"""Calculate a datetime from date, quarter, minute and second of an event."""
mins = 15 * (row['Quarter'] - 1) + row['Minute']
hours, mins = divmod(mins, 60)
return "{} {}:{:02}:{:02}".format(row['GameDate'], hours, mins, row['Second'])
In [5]:
df['datetime'] = pd.to_datetime(df.apply(event_to_datetime, axis=1))
In [6]:
car = df[(df.OffenseTeam=='CAR')]
Pandas plot does a decent job, but doesn't know about categoricals. We can't use the game date string for X axis. Here we use the datetime we calculated from quarter, mins etc. But now it thinks it's a time series. Which it looks like one. Consider it instead a form of parallel plot, ignoring the slope graph between each date, since it doesn't mean anything here (pandas also has an actual parallel plot).
In [7]:
ax = car.plot(x='datetime', y='Yards')
Not bad, but not completely helpful. Sure, pandas also has bar plots. But I think something else could be better visually. Let's see what Seaborn has to offer. How about a strip plot? It is a scatter plot for categorical data. We'll add jitter on the x axis to better see the data
In [8]:
g = sns.stripplot(x='GameDate', y='Yards', data=car, jitter=True)
for item in g.get_xticklabels(): item.set_rotation(60)
In [9]:
# We can also alter the look of the strip plot significantly
g = sns.stripplot(x='Yards', y='GameDate', data=car,
palette="Set2", size=6, marker="D", edgecolor="gray", alpha=.25)
In [10]:
car_atl = df[(df.OffenseTeam=='CAR')|(df.OffenseTeam=='ATL')]
Colors can really improve readability. Atlanta Falcons primary color is red and Carolina Panthers primary color is light blue. Using those (context manager with color_palette):
In [11]:
with sns.color_palette([sns.color_palette("muted")[2],sns.color_palette("muted")[5]]):
g = sns.stripplot(x='Quarter', y='Yards', data=car_atl, hue='OffenseTeam', jitter=True)
g.hlines(0,-1,6, color='grey')
In [12]:
ax = car.boxplot(column='Yards', by='GameDate')
ax.set_title("Carolina offence Yardage by game")
Out[12]:
Let's have a look at the same thing using Seaborn. We'll fix the x axis tick labels too, rotating them.
In [13]:
g = sns.boxplot(data=car, y='Yards', x='GameDate')
for item in g.get_xticklabels(): item.set_rotation(60)
Finally, let's look at one more way to look at the distribution of data using Seaborn's violin plot.
In [14]:
g = sns.violinplot(data=car, x='Yards', y='GameDate', orient='h')
g.vlines(0,-1,15, alpha=0.5)
Out[14]:
We've barely touched on strip plots, box plots and violin plots. It's your turn to go on and explore. And as for the data, we've looked at every single events, play and non play (false starts etc), penalties touchdowns etc all on an equal footing. In order to gain better insight on the data, we'd have to look at these things individually, assign weighs etc.
If I get enough demand, I'll cover this subject in more detail in the future.
In [ ]: