Lab 1: Exploring NFL Play-By-Play Data

Data Loading and Preprocessing

To begin, we load the data into a Pandas data frame from a csv file.



In [3]:

    
import pandas as pd
import numpy as np

df = pd.read_csv('data/data.csv') # read in the csv file









    



//anaconda/lib/python3.5/site-packages/IPython/core/interactiveshell.py:2723: DtypeWarning: Columns (26) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Let's take a cursory glance at the data to see what we're working with.



In [4]:

    
df.head()









    Out[4]:






  
    
      
      Unnamed: 0
      Date
      GameID
      Drive
      qtr
      down
      time
      TimeUnder
      TimeSecs
      PlayTimeDiff
      ...
      Accepted.Penalty
      PenalizedTeam
      PenaltyType
      PenalizedPlayer
      Penalty.Yards
      PosTeamScore
      DefTeamScore
      ScoreDiff
      AbsScoreDiff
      Season
    
  
  
    
      0
      36
      2015-09-10
      2015091000
      1
      1
      NaN
      15:00
      15.0
      3600.0
      0.0
      ...
      0
      NaN
      NaN
      NaN
      0
      0.0
      0.0
      0.0
      0.0
      2015
    
    
      1
      51
      2015-09-10
      2015091000
      1
      1
      1.0
      15:00
      15.0
      3600.0
      0.0
      ...
      0
      NaN
      NaN
      NaN
      0
      0.0
      0.0
      0.0
      0.0
      2015
    
    
      2
      72
      2015-09-10
      2015091000
      1
      1
      1.0
      14:21
      15.0
      3561.0
      39.0
      ...
      0
      NaN
      NaN
      NaN
      0
      0.0
      0.0
      0.0
      0.0
      2015
    
    
      3
      101
      2015-09-10
      2015091000
      1
      1
      2.0
      14:04
      15.0
      3544.0
      17.0
      ...
      0
      NaN
      NaN
      NaN
      0
      0.0
      0.0
      0.0
      0.0
      2015
    
    
      4
      122
      2015-09-10
      2015091000
      1
      1
      1.0
      13:26
      14.0
      3506.0
      38.0
      ...
      0
      NaN
      NaN
      NaN
      0
      0.0
      0.0
      0.0
      0.0
      2015
    
  

5 rows × 64 columns

There's a lot of data that we don't care about. For example, 'PassAttempt' is a binary attribute, but there's also an attribute called 'PlayType' which is set to 'Pass' for a passing play.

We define a list of the columns which we're not interested in, and then we delete them



In [5]:

    
columns_to_delete = ['Unnamed: 0', 'Date', 'time', 'TimeUnder', 
                     'PosTeamScore', 'PassAttempt', 'RushAttempt', 
                     'DefTeamScore', 'Season', 'PlayAttempted']

#Iterate through and delete the columns we don't want
for col in columns_to_delete:
    if col in df:
        del df[col]

We can then grab a list of the remaining column names



In [6]:

    
df.columns









    Out[6]:





Index(['GameID', 'Drive', 'qtr', 'down', 'TimeSecs', 'PlayTimeDiff',
       'SideofField', 'yrdln', 'yrdline100', 'ydstogo', 'ydsnet', 'GoalToGo',
       'FirstDown', 'posteam', 'DefensiveTeam', 'desc', 'Yards.Gained', 'sp',
       'Touchdown', 'ExPointResult', 'TwoPointConv', 'DefTwoPoint', 'Safety',
       'PlayType', 'Passer', 'PassOutcome', 'PassLength', 'PassLocation',
       'InterceptionThrown', 'Interceptor', 'Rusher', 'RunLocation', 'RunGap',
       'Receiver', 'Reception', 'ReturnResult', 'Returner', 'Tackler1',
       'Tackler2', 'FieldGoalResult', 'FieldGoalDistance', 'Fumble',
       'RecFumbTeam', 'RecFumbPlayer', 'Sack', 'Challenge.Replay',
       'ChalReplayResult', 'Accepted.Penalty', 'PenalizedTeam', 'PenaltyType',
       'PenalizedPlayer', 'Penalty.Yards', 'ScoreDiff', 'AbsScoreDiff'],
      dtype='object')

Temporary simple data replacement so that we can cast to integers (instead of objects)



In [7]:

    
df.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46129 entries, 0 to 46128
Data columns (total 54 columns):
GameID                46129 non-null int64
Drive                 46129 non-null int64
qtr                   46129 non-null int64
down                  39006 non-null float64
TimeSecs              46102 non-null float64
PlayTimeDiff          46075 non-null float64
SideofField           46063 non-null object
yrdln                 46021 non-null float64
yrdline100            46021 non-null float64
ydstogo               46129 non-null int64
ydsnet                46129 non-null int64
GoalToGo              46021 non-null float64
FirstDown             42811 non-null float64
posteam               42878 non-null object
DefensiveTeam         42878 non-null object
desc                  46129 non-null object
Yards.Gained          46129 non-null int64
sp                    46129 non-null int64
Touchdown             46129 non-null int64
ExPointResult         1131 non-null object
TwoPointConv          89 non-null object
DefTwoPoint           5 non-null object
Safety                46129 non-null int64
PlayType              46129 non-null object
Passer                19398 non-null object
PassOutcome           19435 non-null object
PassLength            19291 non-null object
PassLocation          19291 non-null object
InterceptionThrown    46129 non-null int64
Interceptor           467 non-null object
Rusher                13080 non-null object
RunLocation           12969 non-null object
RunGap                9588 non-null object
Receiver              18458 non-null object
Reception             46129 non-null int64
ReturnResult          2340 non-null object
Returner              2490 non-null object
Tackler1              24903 non-null object
Tackler2              3356 non-null object
FieldGoalResult       1001 non-null object
FieldGoalDistance     989 non-null float64
Fumble                46129 non-null int64
RecFumbTeam           481 non-null object
RecFumbPlayer         481 non-null object
Sack                  46129 non-null int64
Challenge.Replay      46129 non-null int64
ChalReplayResult      413 non-null object
Accepted.Penalty      46129 non-null int64
PenalizedTeam         3535 non-null object
PenaltyType           1952 non-null object
PenalizedPlayer       3404 non-null object
Penalty.Yards         46129 non-null int64
ScoreDiff             42878 non-null float64
AbsScoreDiff          42878 non-null float64
dtypes: float64(10), int64(16), object(28)
memory usage: 19.0+ MB



In [8]:

    
df = df.replace(to_replace=np.nan,value=-1)

At this point, lots of things are encoded as objects, or with excesively large data types



In [9]:

    
df.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46129 entries, 0 to 46128
Data columns (total 54 columns):
GameID                46129 non-null int64
Drive                 46129 non-null int64
qtr                   46129 non-null int64
down                  46129 non-null float64
TimeSecs              46129 non-null float64
PlayTimeDiff          46129 non-null float64
SideofField           46129 non-null object
yrdln                 46129 non-null float64
yrdline100            46129 non-null float64
ydstogo               46129 non-null int64
ydsnet                46129 non-null int64
GoalToGo              46129 non-null float64
FirstDown             46129 non-null float64
posteam               46129 non-null object
DefensiveTeam         46129 non-null object
desc                  46129 non-null object
Yards.Gained          46129 non-null int64
sp                    46129 non-null int64
Touchdown             46129 non-null int64
ExPointResult         46129 non-null object
TwoPointConv          46129 non-null object
DefTwoPoint           46129 non-null object
Safety                46129 non-null int64
PlayType              46129 non-null object
Passer                46129 non-null object
PassOutcome           46129 non-null object
PassLength            46129 non-null object
PassLocation          46129 non-null object
InterceptionThrown    46129 non-null int64
Interceptor           46129 non-null object
Rusher                46129 non-null object
RunLocation           46129 non-null object
RunGap                46129 non-null object
Receiver              46129 non-null object
Reception             46129 non-null int64
ReturnResult          46129 non-null object
Returner              46129 non-null object
Tackler1              46129 non-null object
Tackler2              46129 non-null object
FieldGoalResult       46129 non-null object
FieldGoalDistance     46129 non-null float64
Fumble                46129 non-null int64
RecFumbTeam           46129 non-null object
RecFumbPlayer         46129 non-null object
Sack                  46129 non-null int64
Challenge.Replay      46129 non-null int64
ChalReplayResult      46129 non-null object
Accepted.Penalty      46129 non-null int64
PenalizedTeam         46129 non-null object
PenaltyType           46129 non-null object
PenalizedPlayer       46129 non-null object
Penalty.Yards         46129 non-null int64
ScoreDiff             46129 non-null float64
AbsScoreDiff          46129 non-null float64
dtypes: float64(10), int64(16), object(28)
memory usage: 19.0+ MB

We define four lists based on the types of features we're using. Binary features are separated from the other categorical features so that they can be stored in less space



In [10]:

    
continuous_features = ['TimeSecs', 'PlayTimeDiff', 'yrdln', 'yrdline100',
                       'ydstogo', 'ydsnet', 'Yards.Gained', 'Penalty.Yards',
                       'ScoreDiff', 'AbsScoreDiff']

ordinal_features = ['Drive', 'qtr', 'down']
binary_features = ['GoalToGo', 'FirstDown','sp', 'Touchdown', 'Safety', 'Fumble']
categorical_features = df.columns.difference(continuous_features).difference(ordinal_features)

We then cast all of the columns to the appropriate underlying data types



In [11]:

    
df[continuous_features] = df[continuous_features].astype(np.float64)
df[ordinal_features] = df[ordinal_features].astype(np.int64)
df[binary_features] = df[binary_features].astype(np.int8)

THIS IS SOME MORE REFORMATTING SHIT I'M DOING FOR NOW. PROLLY GONNA KEEP IT



In [12]:

    
df['PassOutcome'].replace(['Complete', 'Incomplete Pass'], [1, 0], inplace=True)



In [13]:

    
df = df[df["PlayType"] != 'Quarter End']
df = df[df["PlayType"] != 'Two Minute Warning']
df = df[df["PlayType"] != 'End of Game']

Now all of the objects are encoded the way we'd like them to be



In [14]:

    
df.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 44762 entries, 0 to 46128
Data columns (total 54 columns):
GameID                44762 non-null int64
Drive                 44762 non-null int64
qtr                   44762 non-null int64
down                  44762 non-null int64
TimeSecs              44762 non-null float64
PlayTimeDiff          44762 non-null float64
SideofField           44762 non-null object
yrdln                 44762 non-null float64
yrdline100            44762 non-null float64
ydstogo               44762 non-null float64
ydsnet                44762 non-null float64
GoalToGo              44762 non-null int8
FirstDown             44762 non-null int8
posteam               44762 non-null object
DefensiveTeam         44762 non-null object
desc                  44762 non-null object
Yards.Gained          44762 non-null float64
sp                    44762 non-null int8
Touchdown             44762 non-null int8
ExPointResult         44762 non-null object
TwoPointConv          44762 non-null object
DefTwoPoint           44762 non-null object
Safety                44762 non-null int8
PlayType              44762 non-null object
Passer                44762 non-null object
PassOutcome           44762 non-null int64
PassLength            44762 non-null object
PassLocation          44762 non-null object
InterceptionThrown    44762 non-null int64
Interceptor           44762 non-null object
Rusher                44762 non-null object
RunLocation           44762 non-null object
RunGap                44762 non-null object
Receiver              44762 non-null object
Reception             44762 non-null int64
ReturnResult          44762 non-null object
Returner              44762 non-null object
Tackler1              44762 non-null object
Tackler2              44762 non-null object
FieldGoalResult       44762 non-null object
FieldGoalDistance     44762 non-null float64
Fumble                44762 non-null int8
RecFumbTeam           44762 non-null object
RecFumbPlayer         44762 non-null object
Sack                  44762 non-null int64
Challenge.Replay      44762 non-null int64
ChalReplayResult      44762 non-null object
Accepted.Penalty      44762 non-null int64
PenalizedTeam         44762 non-null object
PenaltyType           44762 non-null object
PenalizedPlayer       44762 non-null object
Penalty.Yards         44762 non-null float64
ScoreDiff             44762 non-null float64
AbsScoreDiff          44762 non-null float64
dtypes: float64(11), int64(10), int8(6), object(27)
memory usage: 17.0+ MB

Now we can start to take a look at what's in each of our columns



In [15]:

    
df.describe()









    Out[15]:






  
    
      
      GameID
      Drive
      qtr
      down
      TimeSecs
      PlayTimeDiff
      yrdln
      yrdline100
      ydstogo
      ydsnet
      ...
      InterceptionThrown
      Reception
      FieldGoalDistance
      Fumble
      Sack
      Challenge.Replay
      Accepted.Penalty
      Penalty.Yards
      ScoreDiff
      AbsScoreDiff
    
  
  
    
      count
      4.476200e+04
      44762.000000
      44762.000000
      44762.000000
      44762.000000
      44762.000000
      44762.000000
      44762.000000
      44762.000000
      44762.000000
      ...
      44762.000000
      44762.000000
      44762.000000
      44762.000000
      44762.000000
      44762.000000
      44762.000000
      44762.000000
      44762.000000
      44762.000000
    
    
      mean
      2.015164e+09
      12.194920
      2.580135
      1.611568
      1699.955342
      20.229860
      28.521827
      49.954381
      7.533399
      26.854207
      ...
      0.010455
      0.267571
      -0.134623
      0.014007
      0.028238
      0.009227
      0.078973
      0.672311
      -1.059336
      7.599929
    
    
      std
      2.181743e+05
      7.132299
      1.134654
      1.372797
      1064.910674
      17.735978
      12.630670
      24.916942
      4.824107
      25.428661
      ...
      0.101716
      0.442697
      5.958572
      0.117522
      0.165655
      0.095612
      0.269700
      2.755569
      10.713988
      7.625716
    
    
      min
      2.015091e+09
      1.000000
      1.000000
      -1.000000
      -747.000000
      -1.000000
      -1.000000
      -1.000000
      0.000000
      -48.000000
      ...
      0.000000
      0.000000
      -1.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      -41.000000
      -1.000000
    
    
      25%
      2.015101e+09
      6.000000
      2.000000
      1.000000
      771.000000
      5.000000
      20.000000
      32.000000
      4.000000
      5.000000
      ...
      0.000000
      0.000000
      -1.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      -7.000000
      2.000000
    
    
      50%
      2.015111e+09
      12.000000
      3.000000
      2.000000
      1800.000000
      16.000000
      30.000000
      51.000000
      10.000000
      20.000000
      ...
      0.000000
      0.000000
      -1.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      6.000000
    
    
      75%
      2.015121e+09
      18.000000
      4.000000
      3.000000
      2589.000000
      37.000000
      38.000000
      72.000000
      10.000000
      45.000000
      ...
      0.000000
      1.000000
      -1.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      4.000000
      11.000000
    
    
      max
      2.016010e+09
      33.000000
      5.000000
      4.000000
      3600.000000
      940.000000
      50.000000
      99.000000
      42.000000
      99.000000
      ...
      1.000000
      1.000000
      66.000000
      1.000000
      1.000000
      1.000000
      1.000000
      55.000000
      41.000000
      41.000000
    
  

8 rows × 27 columns



In [16]:

    
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
#Embed figures in the Jupyter Notebook
%matplotlib inline

#Use GGPlot style for matplotlib
plt.style.use('ggplot')



In [17]:

    
pass_plays = df[df['PlayType'] == "Pass"]
pass_plays_grouped = pass_plays.groupby(by=['Passer'])

Look at the number of yards gained by a FirstDown



In [18]:

    
first_downs_grouped = df.groupby(by=['FirstDown'])

print(first_downs_grouped['Yards.Gained'].count())
print("-----------------------------")
print(first_downs_grouped['Yards.Gained'].sum())
print("-----------------------------")
print(first_downs_grouped['Yards.Gained'].sum()/first_downs_grouped['Yards.Gained'].count())









    



FirstDown
-1     3007
 0    29370
 1    12385
Name: Yards.Gained, dtype: int64
-----------------------------
FirstDown
-1     24201.0
 0     81642.0
 1    119522.0
Name: Yards.Gained, dtype: float64
-----------------------------
FirstDown
-1    8.048221
 0    2.779775
 1    9.650545
Name: Yards.Gained, dtype: float64

Group by play type



In [19]:

    
plays_grouped = df.groupby(by=['PlayType'])
print(plays_grouped['Yards.Gained'].count())
print("-----------------------------")
print(plays_grouped['Yards.Gained'].sum())
print("-----------------------------")
print(plays_grouped['Yards.Gained'].sum()/plays_grouped['Yards.Gained'].count())









    



PlayType
Extra Point     1126
Field Goal       988
Kickoff         2565
No Play         2608
Onside Kick       67
Pass           18323
Punt            2429
QB Kneel         425
Run            13129
Sack            1191
Spike             52
Timeout         1859
Name: Yards.Gained, dtype: int64
-----------------------------
PlayType
Extra Point         0.0
Field Goal        197.0
Kickoff         25217.0
No Play          6370.0
Onside Kick        49.0
Pass           133854.0
Punt            11364.0
QB Kneel         -453.0
Run             56627.0
Sack            -7868.0
Spike               0.0
Timeout             8.0
Name: Yards.Gained, dtype: float64
-----------------------------
PlayType
Extra Point    0.000000
Field Goal     0.199393
Kickoff        9.831189
No Play        2.442485
Onside Kick    0.731343
Pass           7.305245
Punt           4.678469
QB Kneel      -1.065882
Run            4.313124
Sack          -6.606213
Spike          0.000000
Timeout        0.004303
Name: Yards.Gained, dtype: float64

We can eliminate combos who didn't have at least 10 receptions together, and then re-sample the data. This will remove noise from QB-receiver combos who have very high or low completion rates because they've played very little together.



In [20]:

    
size = 10
corr = df.corr()
fig, ax = plt.subplots(figsize=(size, size))
ax.matshow(corr)
plt.xticks(range(len(corr.columns)), corr.columns)
for tick in ax.get_xticklabels():
    tick.set_rotation(90)
plt.yticks(range(len(corr.columns)), corr.columns)









    Out[20]:





([<matplotlib.axis.YTick at 0x116b0d668>,
  <matplotlib.axis.YTick at 0x117adda20>,
  <matplotlib.axis.YTick at 0x116b0d438>,
  <matplotlib.axis.YTick at 0x117fefc50>,
  <matplotlib.axis.YTick at 0x117fe26a0>,
  <matplotlib.axis.YTick at 0x117fd5668>,
  <matplotlib.axis.YTick at 0x117fbbda0>,
  <matplotlib.axis.YTick at 0x117ffe8d0>,
  <matplotlib.axis.YTick at 0x117ff64e0>,
  <matplotlib.axis.YTick at 0x117fecdd8>,
  <matplotlib.axis.YTick at 0x117fe2630>,
  <matplotlib.axis.YTick at 0x117fdb240>,
  <matplotlib.axis.YTick at 0x117fcfb38>,
  <matplotlib.axis.YTick at 0x117fc7390>,
  <matplotlib.axis.YTick at 0x116afc470>,
  <matplotlib.axis.YTick at 0x117ad1438>,
  <matplotlib.axis.YTick at 0x118003ac8>,
  <matplotlib.axis.YTick at 0x11800e400>,
  <matplotlib.axis.YTick at 0x11800ee48>,
  <matplotlib.axis.YTick at 0x1180118d0>,
  <matplotlib.axis.YTick at 0x118013358>,
  <matplotlib.axis.YTick at 0x118013da0>,
  <matplotlib.axis.YTick at 0x118015828>,
  <matplotlib.axis.YTick at 0x1180172b0>,
  <matplotlib.axis.YTick at 0x118017cf8>,
  <matplotlib.axis.YTick at 0x11801a780>,
  <matplotlib.axis.YTick at 0x118020208>],
 <a list of 27 Text yticklabel objects>)

We can also extract the highest-completion percentage combos. Here we take the top-10 most reliable QB-receiver pairs.



In [21]:

    
import seaborn as sns
%matplotlib inline



In [22]:

    
# df_dropped = df.dropna()
# df_dropped.info()
selected_types = df.select_dtypes(exclude=["object"])

useful_attributes = df[['FieldGoalDistance','ydstogo']]
print(useful_attributes)









    



       FieldGoalDistance  ydstogo
0                   -1.0      0.0
1                   -1.0     10.0
2                   -1.0     10.0
3                   -1.0      1.0
4                   -1.0     10.0
5                   -1.0     10.0
6                   -1.0     10.0
7                   -1.0     18.0
8                   -1.0     28.0
9                   -1.0     22.0
10                  44.0     12.0
11                  -1.0     10.0
12                  -1.0     10.0
13                  -1.0     10.0
14                  -1.0     10.0
15                  -1.0     10.0
16                  -1.0     10.0
17                  -1.0     10.0
18                  -1.0     10.0
19                  -1.0      4.0
20                  -1.0      5.0
21                  -1.0     10.0
22                  -1.0     15.0
23                  -1.0     12.0
24                  -1.0     18.0
25                  -1.0      1.0
26                  -1.0     10.0
27                  -1.0      2.0
28                  -1.0      1.0
29                  -1.0     10.0
...                  ...      ...
46098               -1.0      4.0
46099               -1.0     10.0
46100               -1.0     10.0
46101               -1.0      7.0
46102               -1.0      3.0
46103               43.0     10.0
46104               -1.0      0.0
46105               -1.0     10.0
46106               -1.0     10.0
46107               -1.0     10.0
46108               -1.0     10.0
46109               -1.0     10.0
46110               -1.0     10.0
46111               -1.0     13.0
46112               -1.0      0.0
46113               -1.0     13.0
46114               -1.0     10.0
46116               -1.0     10.0
46117               -1.0      0.0
46118               -1.0      5.0
46119               -1.0      0.0
46120               -1.0      3.0
46121               -1.0     10.0
46122               -1.0     10.0
46123               -1.0     10.0
46124               -1.0     10.0
46125               -1.0     10.0
46126               -1.0      3.0
46127               -1.0      3.0
46128               -1.0      2.0

[44762 rows x 2 columns]



In [ ]:



In [23]:

    
sns.heatmap(corr)









    Out[23]:





<matplotlib.axes._subplots.AxesSubplot at 0x1248217f0>



In [24]:

    
cluster_corr = sns.clustermap(corr)
plt.setp(cluster_corr.ax_heatmap.yaxis.get_majorticklabels(), rotation=0)
# plt.xticks(rotation=90)









    Out[24]:





[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]



In [25]:

    
fg_analysis = df[['FieldGoalDistance','FieldGoalResult', 'PlayType']]

fg_analysis = fg_analysis[fg_analysis['FieldGoalResult'] != -1]
fg_grouped = fg_analysis.groupby(by=["FieldGoalResult"])
print(fg_grouped.sum()/fg_grouped.count())


sns.violinplot(x="FieldGoalResult", y="FieldGoalDistance", data=fg_analysis, inner="quart")









    



                 FieldGoalDistance  PlayType
FieldGoalResult                             
Blocked                  42.739130       NaN
Good                     36.108747       NaN
No Good                  47.000000       NaN






    Out[25]:





<matplotlib.axes._subplots.AxesSubplot at 0x124b44cf8>



In [26]:

    
fg_analysis = fg_analysis[fg_analysis['FieldGoalResult'] != "Blocked"]
fg_analysis = fg_analysis[fg_analysis['PlayType'] == "Field Goal"]
sns.violinplot(x = "PlayType", y="FieldGoalDistance",  hue="FieldGoalResult", data=fg_analysis, inner="quart", split = True)









    Out[26]:





<matplotlib.axes._subplots.AxesSubplot at 0x124ef45c0>



In [27]:

    
pass_analysis = df[df.PlayType == 'Pass']
pass_analysis = pass_analysis[['PassOutcome','PassLength','PassLocation']]
# print(pass_analysis)


pass_analysis = pass_analysis[pass_analysis.PassLength != -1]
pa_grouped = pass_analysis.groupby(by=['PassLength'])
print(pa_grouped.count())
# pass_analysis['SuccessfulPass'] = pd.cut(df.PassOutcome,[0,1,2],2,labels=['Complete','Incomplete'])

pass_analysis.info()

# Draw a nested violinplot and split the violins for easier comparison
# sns.violinplot(x="PassLocation", y="SuccessfulPass", hue="PassLength", data=pass_analysis, split=True,
#                inner="quart")
# sns.despine(left=True)


pass_info = pd.crosstab([pass_analysis['PassLength'],pass_analysis['PassLocation'] ], 
                       pass_analysis.PassOutcome.astype(bool))
print(pass_info)

pass_info.plot(kind='bar', stacked=True)









    



            PassOutcome  PassLocation
PassLength                           
Deep               3431          3431
Short             14762         14762
<class 'pandas.core.frame.DataFrame'>
Int64Index: 18193 entries, 2 to 46128
Data columns (total 3 columns):
PassOutcome     18193 non-null int64
PassLength      18193 non-null object
PassLocation    18193 non-null object
dtypes: int64(1), object(2)
memory usage: 568.5+ KB
PassOutcome              False  True 
PassLength PassLocation              
Deep       left            847    520
           middle          363    362
           right           848    491
Short      left           1673   3778
           middle         1005   2423
           right          1974   3909






    Out[27]:





<matplotlib.axes._subplots.AxesSubplot at 0x124b4dda0>



In [28]:

    
df.RunGap.value_counts()









    Out[28]:





-1        35174
end        3316
guard      3169
tackle     3103
Name: RunGap, dtype: int64



In [29]:

    
pass_rate = pass_info.div(pass_info.sum(1).astype(float),
                             axis=0) # normalize the value

# print pass_rate
pass_rate.plot(kind='barh', 
                   stacked=True)









    Out[29]:





<matplotlib.axes._subplots.AxesSubplot at 0x1257fd278>



In [30]:

    
# Run data



In [31]:

    
run_analysis = df[df.PlayType == 'Run']
run_analysis = run_analysis[['Yards.Gained','RunGap','RunLocation']]


runlocation_violinplot = sns.violinplot(x="RunLocation", y="Yards.Gained", data=run_analysis, inner="quart")

run_analysis = run_analysis[run_analysis.RunLocation != -1]
run_analysis['RunGap'].replace(-1, 'up the middle',inplace=True)
# run_analysis['RunLocation'].replace(-1, 'no location',inplace=True)


ra_grouped = run_analysis.groupby(by=['RunGap'])

print(ra_grouped.count())
print(run_analysis.info())




sns.set(style="whitegrid", palette="muted")

# Draw a categorical scatterplot to show each observation
sns.factorplot(x="RunLocation", y="Yards.Gained", hue="RunGap", data=run_analysis)
sns.factorplot(x="RunLocation", y="Yards.Gained", hue="RunGap", data=run_analysis,kind="bar")
sns.factorplot(x="RunLocation", y="Yards.Gained", hue="RunGap", data=run_analysis,kind="violin")









    



               Yards.Gained  RunLocation
RunGap                                  
end                    3311         3311
guard                  3165         3165
tackle                 3101         3101
up the middle          3378         3378
<class 'pandas.core.frame.DataFrame'>
Int64Index: 12955 entries, 1 to 46118
Data columns (total 3 columns):
Yards.Gained    12955 non-null float64
RunGap          12955 non-null object
RunLocation     12955 non-null object
dtypes: float64(1), object(2)
memory usage: 404.8+ KB
None






    Out[31]:





<seaborn.axisgrid.FacetGrid at 0x11af53668>



In [32]:

    
#just compare left and right options

run_lr = run_analysis[(run_analysis['RunLocation'] == 'right') | (run_analysis['RunLocation'] == 'left')]
sns.factorplot(x="RunLocation", y="Yards.Gained", hue="RunGap", data=run_lr,kind="bar")









    Out[32]:





<seaborn.axisgrid.FacetGrid at 0x12589e6a0>



In [33]:

    
rungap_violinplot = sns.violinplot(x="RunGap", y="Yards.Gained", data=run_analysis, inner="quart")



In [ ]:

	Unnamed: 0	Date	GameID	Drive	qtr	down	time	TimeUnder	TimeSecs	PlayTimeDiff	...	PenalizedTeam	PenaltyType	PenalizedPlayer	Season
0	36	2015-09-10	2015091000	1	1	NaN	15:00	15.0	3600.0	0.0	...	NaN	NaN	NaN	2015
1	51	2015-09-10	2015091000	1	1	1.0	15:00	15.0	3600.0	0.0	...	NaN	NaN	NaN	2015
2	72	2015-09-10	2015091000	1	1	1.0	14:21	15.0	3561.0	39.0	...	NaN	NaN	NaN	2015
3	101	2015-09-10	2015091000	1	1	2.0	14:04	15.0	3544.0	17.0	...	NaN	NaN	NaN	2015
4	122	2015-09-10	2015091000	1	1	1.0	13:26	14.0	3506.0	38.0	...	NaN	NaN	NaN	2015

	GameID	Drive	qtr	down	TimeSecs	PlayTimeDiff	yrdln	yrdline100	ydstogo	ydsnet	...	InterceptionThrown	Reception	FieldGoalDistance	Fumble	Sack	Challenge.Replay	Accepted.Penalty	Penalty.Yards	ScoreDiff	AbsScoreDiff
count	4.476200e+04	44762.000000	44762.000000	44762.000000	44762.000000	44762.000000	44762.000000	44762.000000	44762.000000	44762.000000	...	44762.000000	44762.000000	44762.000000	44762.000000	44762.000000	44762.000000	44762.000000	44762.000000	44762.000000	44762.000000
mean	2.015164e+09	12.194920	2.580135	1.611568	1699.955342	20.229860	28.521827	49.954381	7.533399	26.854207	...	0.010455	0.267571	-0.134623	0.014007	0.028238	0.009227	0.078973	0.672311	-1.059336	7.599929
std	2.181743e+05	7.132299	1.134654	1.372797	1064.910674	17.735978	12.630670	24.916942	4.824107	25.428661	...	0.101716	0.442697	5.958572	0.117522	0.165655	0.095612	0.269700	2.755569	10.713988	7.625716
min	2.015091e+09	1.000000	1.000000	-1.000000	-747.000000	-1.000000	-1.000000	-1.000000	0.000000	-48.000000	...	0.000000	0.000000	-1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	-41.000000	-1.000000
25%	2.015101e+09	6.000000	2.000000	1.000000	771.000000	5.000000	20.000000	32.000000	4.000000	5.000000	...	0.000000	0.000000	-1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	-7.000000	2.000000
50%	2.015111e+09	12.000000	3.000000	2.000000	1800.000000	16.000000	30.000000	51.000000	10.000000	20.000000	...	0.000000	0.000000	-1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	6.000000
75%	2.015121e+09	18.000000	4.000000	3.000000	2589.000000	37.000000	38.000000	72.000000	10.000000	45.000000	...	0.000000	1.000000	-1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	4.000000	11.000000
max	2.016010e+09	33.000000	5.000000	4.000000	3600.000000	940.000000	50.000000	99.000000	42.000000	99.000000	...	1.000000	1.000000	66.000000	1.000000	1.000000	1.000000	1.000000	55.000000	41.000000	41.000000