In [1]:
from ggplot import *
import pandas as pd
import numpy as np
In [ ]:
%matplotlib inline
In [3]:
df = pd.read_csv("./baseball-pitches-clean.csv")
df = df[['pitch_time', 'inning', 'pitcher_name', 'hitter_name', 'pitch_type',
'px', 'pz', 'pitch_name', 'start_speed', 'end_speed', 'type_confidence']]
df.head()
Out[3]:
px and pz are the coordinates of a pitch as they cross home plate. Let's plot these and see if our data makes sense.
In [3]:
ggplot(df, aes(x='px', y='pz')) + geom_point()
Out[3]:
What about the pitch speed?
In [4]:
ggplot(aes(x='start_speed', y='end_speed'), data=df) + geom_point()
Out[4]:
A better way to inspect pitch speed might be to look at a distribution of the data.
Does this make sense? Let's consult the source of all true wisdom: https://answers.yahoo.com/question/index?qid=20080126131031AAwVCNk
In [4]:
ggplot(df, aes(x='start_speed')) + geom_histogram()
Out[4]:
What about for specific pitches?
In [5]:
for name, frame in df.groupby("pitch_name"):
print ggplot(aes(x='start_speed'), data=frame) + geom_histogram() + ggtitle("Distribution of " + str(name))
That was helpful but I'm sort of on plot overload now.
facet_wrap FTWUse the trellis.
"Trellis Graphics is a family of techniques for viewing complex, multi-variable data sets." Read more here.
In [6]:
ggplot(aes(x='start_speed'), data=df) +\
geom_histogram() +\
facet_wrap('pitch_name')
Out[6]:
In [15]:
from IPython.display import YouTubeVideo
YouTubeVideo("ikLlRT2j7EQ")
Out[15]:
Ok so what about balls and strikes.
In [8]:
ggplot(aes(x='pitch_type'), data=df) + geom_bar()
Out[8]:
In [9]:
ggplot(aes(x='start_speed'), data=df) +\
geom_histogram() +\
facet_grid('pitch_type')
Out[9]:
In [12]:
ggplot(aes(x='start_speed'), data=df) +\
geom_histogram() +\
facet_grid('pitch_name', 'pitch_type', scales="free")
Out[12]:
In [13]:
ggplot(df, aes(x='start_speed')) +\
geom_density()
Out[13]:
In [14]:
ggplot(df, aes(x='start_speed', color='pitch_name')) +\
geom_density()
Out[14]:
In [ ]: