The objective here to generate four box-and-whisker graphs:
In [1]:
import pandas as pd
from pylab import *
%matplotlib inline
import seaborn as sns
import os.path
sns.set()
sns.reset_orig() # want to use matplotlib style, not seaborn style
import scrapenhl2.scrape.team_info as team_info
import scrapenhl2.scrape.schedules as schedules
import scrapenhl2.scrape.teams as teams
import scrapenhl2.manipulate.manipulate as manip
To aggregate data, loop over both seasons. In each season, loop over each team. Read in its season PBP, filter for Corsi events, group by game and acting team and sum. Then, go long to wide, so each row is a game.
Aggregate over teams, add a flag for whether the team is the Caps--we need this for the graph below--and we're good to go.
In [2]:
dflst = {2016: [], 2017: []}
for season in dflst.keys():
for team in schedules.get_teams_in_season(season):
if not os.path.exists(teams.get_team_pbp_filename(season, team)):
continue
df = teams.get_team_pbp(season, team)
df = manip.filter_for_corsi(df)
df.loc[:, 'Team2'] = df.Team.apply(lambda x: 'CF' if x == team else 'CA')
df = df[['Team2', 'Game']] \
.query("Game <= 21230") \
.assign(Count=1) \
.groupby(['Team2', 'Game']) \
.count() \
.pivot_table(index='Game', columns='Team2', values='Count') \
.reset_index() \
.assign(Team=team_info.team_as_str(team), Season=season)
dflst[season].append(df)
df = pd.concat(dflst[season])
df.loc[:, 'IsWSH'] = df.Team.apply(lambda x: x == 'WSH')
dflst[season] = df
print(dflst[2017].head())
We want to order teams in each graph by the median. That means we'll be comaparing horizontal positions of the Caps' box-and-whisker--meaning we should arrange 2016-17 on top and 2017-18 on the bottom.
To get all this to work I use a little trickery with modulus and integer division. But it's straightforward to brute-force this as well (it's only four axes).
In [3]:
fig, axes = subplots(2, 2, sharey=True, figsize=[12, 12])
axes = axes.flatten()
xticks(rotation=45)
for i in range(4):
season = 2016 + i // 2
df = dflst[season]
if i % 2 == 0:
order = df[['Team', 'CF']].groupby('Team').median().sort_values('CF', ascending=False).index
sns.boxplot(x='Team', y='CF', hue='IsWSH', order=order, data=df, ax=axes[i])
else:
order = df[['Team', 'CA']].groupby('Team').median().sort_values('CA').index
sns.boxplot(x='Team', y='CA', hue='IsWSH', order=order, data=df, ax=axes[i])
if i % 2 == 0:
axes[i].set_title('CF by game, {0:d}-{1:s}'.format(season, str(season + 1)[2:]))
else:
axes[i].set_title('CA by game, {0:d}-{1:s}'.format(season, str(season + 1)[2:]))
setp(axes[i].xaxis.get_majorticklabels(), rotation=45)
fig.tight_layout()