Data Loading and Preprocessing

To begin, we load the data into a Pandas data frame from a csv file.


In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('data/data.csv') # read in the csv file


//anaconda/lib/python3.5/site-packages/IPython/core/interactiveshell.py:2723: DtypeWarning: Columns (26) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Let's take a cursory glance at the data to see what we're working with.


In [2]:
df.head()


Out[2]:
Unnamed: 0 Date GameID Drive qtr down time TimeUnder TimeSecs PlayTimeDiff ... Accepted.Penalty PenalizedTeam PenaltyType PenalizedPlayer Penalty.Yards PosTeamScore DefTeamScore ScoreDiff AbsScoreDiff Season
0 36 2015-09-10 2015091000 1 1 NaN 15:00 15.0 3600.0 0.0 ... 0 NaN NaN NaN 0 0.0 0.0 0.0 0.0 2015
1 51 2015-09-10 2015091000 1 1 1.0 15:00 15.0 3600.0 0.0 ... 0 NaN NaN NaN 0 0.0 0.0 0.0 0.0 2015
2 72 2015-09-10 2015091000 1 1 1.0 14:21 15.0 3561.0 39.0 ... 0 NaN NaN NaN 0 0.0 0.0 0.0 0.0 2015
3 101 2015-09-10 2015091000 1 1 2.0 14:04 15.0 3544.0 17.0 ... 0 NaN NaN NaN 0 0.0 0.0 0.0 0.0 2015
4 122 2015-09-10 2015091000 1 1 1.0 13:26 14.0 3506.0 38.0 ... 0 NaN NaN NaN 0 0.0 0.0 0.0 0.0 2015

5 rows × 64 columns

There's a lot of data that we don't care about. For example, 'PassAttempt' is a binary attribute, but there's also an attribute called 'PlayType' which is set to 'Pass' for a passing play.

We define a list of the columns which we're not interested in, and then we delete them


In [3]:
columns_to_delete = ['Unnamed: 0', 'Date', 'time', 'TimeUnder', 
                     'PosTeamScore', 'PassAttempt', 'RushAttempt', 
                     'DefTeamScore', 'Season', 'PlayAttempted']

#Iterate through and delete the columns we don't want
for col in columns_to_delete:
    if col in df:
        del df[col]

We can then grab a list of the remaining column names


In [4]:
df.columns


Out[4]:
Index(['GameID', 'Drive', 'qtr', 'down', 'TimeSecs', 'PlayTimeDiff',
       'SideofField', 'yrdln', 'yrdline100', 'ydstogo', 'ydsnet', 'GoalToGo',
       'FirstDown', 'posteam', 'DefensiveTeam', 'desc', 'Yards.Gained', 'sp',
       'Touchdown', 'ExPointResult', 'TwoPointConv', 'DefTwoPoint', 'Safety',
       'PlayType', 'Passer', 'PassOutcome', 'PassLength', 'PassLocation',
       'InterceptionThrown', 'Interceptor', 'Rusher', 'RunLocation', 'RunGap',
       'Receiver', 'Reception', 'ReturnResult', 'Returner', 'Tackler1',
       'Tackler2', 'FieldGoalResult', 'FieldGoalDistance', 'Fumble',
       'RecFumbTeam', 'RecFumbPlayer', 'Sack', 'Challenge.Replay',
       'ChalReplayResult', 'Accepted.Penalty', 'PenalizedTeam', 'PenaltyType',
       'PenalizedPlayer', 'Penalty.Yards', 'ScoreDiff', 'AbsScoreDiff'],
      dtype='object')

Temporary simple data replacement so that we can cast to integers (instead of objects)


In [5]:
df = df.replace(to_replace=np.nan,value=-1)

At this point, lots of things are encoded as objects, or with excesively large data types


In [6]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46129 entries, 0 to 46128
Data columns (total 54 columns):
GameID                46129 non-null int64
Drive                 46129 non-null int64
qtr                   46129 non-null int64
down                  46129 non-null float64
TimeSecs              46129 non-null float64
PlayTimeDiff          46129 non-null float64
SideofField           46129 non-null object
yrdln                 46129 non-null float64
yrdline100            46129 non-null float64
ydstogo               46129 non-null int64
ydsnet                46129 non-null int64
GoalToGo              46129 non-null float64
FirstDown             46129 non-null float64
posteam               46129 non-null object
DefensiveTeam         46129 non-null object
desc                  46129 non-null object
Yards.Gained          46129 non-null int64
sp                    46129 non-null int64
Touchdown             46129 non-null int64
ExPointResult         46129 non-null object
TwoPointConv          46129 non-null object
DefTwoPoint           46129 non-null object
Safety                46129 non-null int64
PlayType              46129 non-null object
Passer                46129 non-null object
PassOutcome           46129 non-null object
PassLength            46129 non-null object
PassLocation          46129 non-null object
InterceptionThrown    46129 non-null int64
Interceptor           46129 non-null object
Rusher                46129 non-null object
RunLocation           46129 non-null object
RunGap                46129 non-null object
Receiver              46129 non-null object
Reception             46129 non-null int64
ReturnResult          46129 non-null object
Returner              46129 non-null object
Tackler1              46129 non-null object
Tackler2              46129 non-null object
FieldGoalResult       46129 non-null object
FieldGoalDistance     46129 non-null float64
Fumble                46129 non-null int64
RecFumbTeam           46129 non-null object
RecFumbPlayer         46129 non-null object
Sack                  46129 non-null int64
Challenge.Replay      46129 non-null int64
ChalReplayResult      46129 non-null object
Accepted.Penalty      46129 non-null int64
PenalizedTeam         46129 non-null object
PenaltyType           46129 non-null object
PenalizedPlayer       46129 non-null object
Penalty.Yards         46129 non-null int64
ScoreDiff             46129 non-null float64
AbsScoreDiff          46129 non-null float64
dtypes: float64(10), int64(16), object(28)
memory usage: 19.0+ MB

We define four lists based on the types of features we're using. Binary features are separated from the other categorical features so that they can be stored in less space


In [7]:
continuous_features = ['TimeSecs', 'PlayTimeDiff', 'yrdln', 'yrdline100',
                       'ydstogo', 'ydsnet', 'Yards.Gained', 'Penalty.Yards',
                       'ScoreDiff', 'AbsScoreDiff']

ordinal_features = ['Drive', 'qtr', 'down']
binary_features = ['GoalToGo', 'FirstDown','sp', 'Touchdown', 'Safety', 'Fumble']
categorical_features = df.columns.difference(continuous_features).difference(ordinal_features)

We then cast all of the columns to the appropriate underlying data types


In [8]:
df[continuous_features] = df[continuous_features].astype(np.float64)
df[ordinal_features] = df[ordinal_features].astype(np.int64)
df[binary_features] = df[binary_features].astype(np.int8)

THIS IS SOME MORE REFORMATTING SHIT I'M DOING FOR NOW. PROLLY GONNA KEEP IT


In [9]:
df['PassOutcome'].replace(['Complete', 'Incomplete Pass'], [1, 0], inplace=True)

Now all of the objects are encoded the way we'd like them to be


In [10]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46129 entries, 0 to 46128
Data columns (total 54 columns):
GameID                46129 non-null int64
Drive                 46129 non-null int64
qtr                   46129 non-null int64
down                  46129 non-null int64
TimeSecs              46129 non-null float64
PlayTimeDiff          46129 non-null float64
SideofField           46129 non-null object
yrdln                 46129 non-null float64
yrdline100            46129 non-null float64
ydstogo               46129 non-null float64
ydsnet                46129 non-null float64
GoalToGo              46129 non-null int8
FirstDown             46129 non-null int8
posteam               46129 non-null object
DefensiveTeam         46129 non-null object
desc                  46129 non-null object
Yards.Gained          46129 non-null float64
sp                    46129 non-null int8
Touchdown             46129 non-null int8
ExPointResult         46129 non-null object
TwoPointConv          46129 non-null object
DefTwoPoint           46129 non-null object
Safety                46129 non-null int8
PlayType              46129 non-null object
Passer                46129 non-null object
PassOutcome           46129 non-null int64
PassLength            46129 non-null object
PassLocation          46129 non-null object
InterceptionThrown    46129 non-null int64
Interceptor           46129 non-null object
Rusher                46129 non-null object
RunLocation           46129 non-null object
RunGap                46129 non-null object
Receiver              46129 non-null object
Reception             46129 non-null int64
ReturnResult          46129 non-null object
Returner              46129 non-null object
Tackler1              46129 non-null object
Tackler2              46129 non-null object
FieldGoalResult       46129 non-null object
FieldGoalDistance     46129 non-null float64
Fumble                46129 non-null int8
RecFumbTeam           46129 non-null object
RecFumbPlayer         46129 non-null object
Sack                  46129 non-null int64
Challenge.Replay      46129 non-null int64
ChalReplayResult      46129 non-null object
Accepted.Penalty      46129 non-null int64
PenalizedTeam         46129 non-null object
PenaltyType           46129 non-null object
PenalizedPlayer       46129 non-null object
Penalty.Yards         46129 non-null float64
ScoreDiff             46129 non-null float64
AbsScoreDiff          46129 non-null float64
dtypes: float64(11), int64(10), int8(6), object(27)
memory usage: 17.2+ MB

Now we can start to take a look at what's in each of our columns


In [11]:
df.describe()


Out[11]:
GameID Drive qtr down TimeSecs PlayTimeDiff yrdln yrdline100 ydstogo ydsnet ... InterceptionThrown Reception FieldGoalDistance Fumble Sack Challenge.Replay Accepted.Penalty Penalty.Yards ScoreDiff AbsScoreDiff
count 4.612900e+04 46129.000000 46129.000000 46129.000000 46129.000000 46129.000000 46129.000000 46129.000000 46129.000000 46129.000000 ... 46129.000000 46129.000000 46129.000000 46129.000000 46129.000000 46129.000000 46129.000000 46129.000000 46129.000000 46129.000000
mean 2.015164e+09 12.279607 2.583407 1.534176 1685.747989 20.189750 28.495870 49.293286 7.310152 26.672289 ... 0.010145 0.259641 -0.160268 0.013592 0.027401 0.008953 0.076633 0.652388 -1.057578 7.345076
std 2.183164e+05 7.144244 1.134256 1.422973 1065.964474 17.618167 12.657879 24.943136 4.920808 25.379980 ... 0.100214 0.438443 5.871452 0.115792 0.163252 0.094198 0.266011 2.716825 10.554045 7.652127
min 2.015091e+09 1.000000 1.000000 -1.000000 -747.000000 -1.000000 -1.000000 -1.000000 0.000000 -48.000000 ... 0.000000 0.000000 -1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -41.000000 -1.000000
25% 2.015101e+09 6.000000 2.000000 1.000000 758.000000 5.000000 20.000000 31.000000 3.000000 5.000000 ... 0.000000 0.000000 -1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -7.000000 1.000000
50% 2.015111e+09 12.000000 3.000000 2.000000 1800.000000 16.000000 30.000000 50.000000 9.000000 20.000000 ... 0.000000 0.000000 -1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 6.000000
75% 2.015121e+09 18.000000 4.000000 3.000000 2579.000000 37.000000 38.000000 71.000000 10.000000 45.000000 ... 0.000000 1.000000 -1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 4.000000 11.000000
max 2.016010e+09 33.000000 5.000000 4.000000 3600.000000 940.000000 50.000000 99.000000 42.000000 99.000000 ... 1.000000 1.000000 66.000000 1.000000 1.000000 1.000000 1.000000 55.000000 41.000000 41.000000

8 rows × 27 columns


In [12]:
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
#Embed figures in the Jupyter Notebook
%matplotlib inline

#Use GGPlot style for matplotlib
plt.style.use('ggplot')

In [14]:
pass_plays = df[df['PlayType'] == "Pass"]
pass_plays_grouped = pass_plays.groupby(by=['Passer'])

We can take a random sample of passers and show their completion rate:


In [15]:
completion_rate = pass_plays_grouped.PassOutcome.sum() / pass_plays_grouped.PassOutcome.count()
completion_rate_sampled = completion_rate.sample(10)
completion_rate_sampled.sort_values(inplace=True)
completion_rate_sampled.plot(kind='barh')


Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x1166ea0f0>

We can also group by both passer and receiver, to check for highly effective QB-Receiver combos.


In [16]:
pass_plays_grouped = pass_plays.groupby(by=['Passer', 'Receiver'])
completion_rate = pass_plays_grouped.PassOutcome.sum() / pass_plays_grouped.PassOutcome.count()
completion_rate_sampled = completion_rate.sample(10)
completion_rate_sampled.sort_values(inplace=True)
completion_rate_sampled.plot(kind='barh')


Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x116a8ca58>

We can eliminate combos who didn't have at least 10 receptions together, and then re-sample the data. This will remove noise from QB-receiver combos who have very high or low completion rates because they've played very little together.


In [17]:
pass_plays_grouped_filtered = pass_plays_grouped.filter(lambda g: len(g)>10).groupby(by=['Passer', 'Receiver'])
completion_rate = pass_plays_grouped_filtered.PassOutcome.sum() / pass_plays_grouped_filtered.PassOutcome.count()
completion_rate_sampled = completion_rate.sample(10)
completion_rate_sampled.sort_values(inplace=True)
completion_rate_sampled.plot(kind='barh')


Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x116b3f2b0>

We can also extract the highest-completion percentage combos. Here we take the top-10 most reliable QB-receiver pairs.


In [18]:
completion_rate.sort_values(inplace=True, ascending = False)
completion_rate = pd.Series(completion_rate)
completion_rate[0:10].plot(kind='barh')


Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x119b2e5c0>

Next, let's find the top ten rushers based on yards-per-carry (only for rushers who have more than 10 carries)


In [19]:
rush_plays = df[(df.Rusher != -1)]
rush_plays_grouped = rush_plays.groupby(by=['Rusher']).filter(lambda g: len(g) > 10).groupby(by=["Rusher"])

In [20]:
yards_per_carry = rush_plays_grouped["Yards.Gained"].sum() / rush_plays_grouped["Yards.Gained"].count()
yards_per_carry.sort_values(inplace=True, ascending=False)
yards_per_carry[0:10].plot(kind='barh')


Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x11ae0e3c8>

Now let's take a look at defenses. Which defenses allowed the fewest overall yards?


In [21]:
def_play_groups = df.groupby("DefensiveTeam")

In [22]:
def_yards_allowed = def_play_groups["Yards.Gained"].sum()
def_yards_allowed.sort_values(inplace=True)
def_yards_allowed.plot(kind='barh')


Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x11c15a048>

It looks like the Denver Broncos allowed the fewest yards overall. Go Broncos! Let's see if there were any specific weaknesses in the Broncos' defense.


In [23]:
broncos_def_plays = df[df.DefensiveTeam == "DEN"]
broncos_def_pass_plays = broncos_def_plays[broncos_def_plays.PlayType == "Pass"]
broncos_def_rush_plays = broncos_def_plays[broncos_def_plays.PlayType == "Run"]

print("Passing yards: " + str(broncos_def_pass_plays["Yards.Gained"].sum()))
print("Rushing yards: " + str(broncos_def_rush_plays["Yards.Gained"].sum()))


Passing yards: 3576.0
Rushing yards: 1345.0

It looks like they gave up a lot more passing yards than rushing yards. Let's see what QBs caused the Denver defense the most trouble.


In [24]:
pass_plays = df[df.PlayType == "Pass"]
pass_plays_against_den = pass_plays[pass_plays.DefensiveTeam == "DEN"]
pass_plays_against_den_grouped = pass_plays_against_den.groupby("Passer")
qbs_yards_against_den = pass_plays_against_den_grouped["Yards.Gained"].sum()
qbs_yards_against_den.sort_values(inplace=True, ascending=False)
qbs_yards_against_den.plot(kind='barh')


Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x119c26f28>

It looks like the Raiders' Derek Carr caused a lot of trouble. From this data we can see that when the Broncos play the Raiders, they need to focus more on pass defense.

To figure out which Oakland WR Denver should put their best cornerback on, let's see which Oakland receiver was the highest-performing against Denver.


In [25]:
oak_pass_plays_against_den = pass_plays_against_den[pass_plays_against_den.posteam == "OAK"]
oak_pass_plays_against_den_grouped = oak_pass_plays_against_den.groupby("Receiver")
oak_receivers_yards_against_den = oak_pass_plays_against_den_grouped["Yards.Gained"].sum()
oak_receivers_yards_against_den.sort_values(inplace=True, ascending=False)
oak_receivers_yards_against_den.plot(kind='barh')


Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x1219d16d8>

M. Rivera and M. Crabtree were the dominant receivers. Going in to the game against Oakland this year, we can expext Denver to put their best cornerbacks on them.

But who are their best cornerbacks? We can best judge that by ranking Denver cornerbacks by interception counts.


In [26]:
denver_interceptions = pass_plays_against_den[pass_plays_against_den.InterceptionThrown == 1]
denver_interceptions_grouped = denver_interceptions.groupby("Interceptor")
denver_cornerback_interceptions = denver_interceptions_grouped.Interceptor.count()
denver_cornerback_interceptions_sorted = denver_cornerback_interceptions.sort_values(ascending=False)

denver_cornerback_interceptions_sorted.plot(kind='barh')


Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x121b235f8>

The Broncos' top choices to cover M. Rivera and M. Crabtree, therefore, are A. Talib and D. Trevathan


In [27]:
for x in df.columns:
    print("* **" + str(x) + "**: Description")


* **GameID**: Description
* **Drive**: Description
* **qtr**: Description
* **down**: Description
* **TimeSecs**: Description
* **PlayTimeDiff**: Description
* **SideofField**: Description
* **yrdln**: Description
* **yrdline100**: Description
* **ydstogo**: Description
* **ydsnet**: Description
* **GoalToGo**: Description
* **FirstDown**: Description
* **posteam**: Description
* **DefensiveTeam**: Description
* **desc**: Description
* **Yards.Gained**: Description
* **sp**: Description
* **Touchdown**: Description
* **ExPointResult**: Description
* **TwoPointConv**: Description
* **DefTwoPoint**: Description
* **Safety**: Description
* **PlayType**: Description
* **Passer**: Description
* **PassOutcome**: Description
* **PassLength**: Description
* **PassLocation**: Description
* **InterceptionThrown**: Description
* **Interceptor**: Description
* **Rusher**: Description
* **RunLocation**: Description
* **RunGap**: Description
* **Receiver**: Description
* **Reception**: Description
* **ReturnResult**: Description
* **Returner**: Description
* **Tackler1**: Description
* **Tackler2**: Description
* **FieldGoalResult**: Description
* **FieldGoalDistance**: Description
* **Fumble**: Description
* **RecFumbTeam**: Description
* **RecFumbPlayer**: Description
* **Sack**: Description
* **Challenge.Replay**: Description
* **ChalReplayResult**: Description
* **Accepted.Penalty**: Description
* **PenalizedTeam**: Description
* **PenaltyType**: Description
* **PenalizedPlayer**: Description
* **Penalty.Yards**: Description
* **ScoreDiff**: Description
* **AbsScoreDiff**: Description

In [28]:
df.SideofField.value_counts()


Out[28]:
PHI    1613
SF     1545
OAK    1544
NO     1528
JAC    1524
HOU    1504
NYG    1503
SD     1498
MIA    1497
IND    1495
TB     1481
BUF    1478
DET    1468
DAL    1452
STL    1440
CLE    1436
PIT    1432
GB     1415
DEN    1413
BAL    1400
TEN    1393
NE     1382
CHI    1360
NYJ    1348
ATL    1344
WAS    1329
MIN    1316
CIN    1307
ARI    1292
SEA    1281
CAR    1272
KC     1253
MID     478
-1       66
50       42
Name: SideofField, dtype: int64

In [29]:
df


Out[29]:
GameID Drive qtr down TimeSecs PlayTimeDiff SideofField yrdln yrdline100 ydstogo ... Sack Challenge.Replay ChalReplayResult Accepted.Penalty PenalizedTeam PenaltyType PenalizedPlayer Penalty.Yards ScoreDiff AbsScoreDiff
0 2015091000 1 1 -1 3600.0 0.0 NE 35.0 35.0 0.0 ... 0 0 -1 0 -1 -1 -1 0.0 0.0 0.0
1 2015091000 1 1 1 3600.0 0.0 PIT 20.0 80.0 10.0 ... 0 0 -1 0 -1 -1 -1 0.0 0.0 0.0
2 2015091000 1 1 1 3561.0 39.0 PIT 38.0 62.0 10.0 ... 0 0 -1 0 -1 -1 -1 0.0 0.0 0.0
3 2015091000 1 1 2 3544.0 17.0 PIT 47.0 53.0 1.0 ... 0 0 -1 0 -1 -1 -1 0.0 0.0 0.0
4 2015091000 1 1 1 3506.0 38.0 NE 49.0 49.0 10.0 ... 0 0 -1 0 -1 -1 -1 0.0 0.0 0.0
5 2015091000 1 1 1 3462.0 44.0 NE 35.0 35.0 10.0 ... 0 0 -1 0 -1 -1 -1 0.0 0.0 0.0
6 2015091000 1 1 1 3425.0 37.0 NE 24.0 24.0 10.0 ... 1 0 -1 0 -1 -1 -1 0.0 0.0 0.0
7 2015091000 1 1 2 3380.0 45.0 NE 32.0 32.0 18.0 ... 0 0 -1 1 PIT -1 M.Gilbert 10.0 0.0 0.0
8 2015091000 1 1 2 3353.0 27.0 NE 42.0 42.0 28.0 ... 0 0 -1 0 -1 -1 -1 0.0 0.0 0.0
9 2015091000 1 1 3 3328.0 25.0 NE 36.0 36.0 22.0 ... 0 0 -1 0 -1 -1 -1 0.0 0.0 0.0
10 2015091000 1 1 4 3284.0 44.0 NE 26.0 26.0 12.0 ... 0 0 -1 0 -1 -1 -1 0.0 0.0 0.0
11 2015091000 2 1 1 3280.0 4.0 NE 34.0 66.0 10.0 ... 0 0 -1 1 NE -1 N.Solder 15.0 0.0 0.0
12 2015091000 2 1 1 3254.0 26.0 NE 32.0 68.0 10.0 ... 0 0 -1 0 -1 -1 -1 0.0 0.0 0.0
13 2015091000 2 1 1 3240.0 14.0 NE 44.0 56.0 10.0 ... 0 0 -1 0 -1 -1 -1 0.0 0.0 0.0
14 2015091000 2 1 1 3211.0 29.0 PIT 43.0 43.0 10.0 ... 0 0 -1 0 -1 -1 -1 0.0 0.0 0.0
15 2015091000 2 1 2 3207.0 4.0 PIT 43.0 43.0 10.0 ... 0 0 -1 0 -1 -1 -1 0.0 0.0 0.0
16 2015091000 2 1 3 3202.0 5.0 PIT 43.0 43.0 10.0 ... 1 0 -1 0 -1 -1 -1 0.0 0.0 0.0
17 2015091000 2 1 4 3168.0 34.0 PIT 43.0 43.0 10.0 ... 0 0 -1 0 -1 -1 -1 0.0 0.0 0.0
18 2015091000 3 1 1 3161.0 7.0 PIT 7.0 93.0 10.0 ... 0 0 -1 0 -1 -1 -1 0.0 0.0 0.0
19 2015091000 3 1 2 3127.0 34.0 PIT 13.0 87.0 4.0 ... 0 0 -1 0 -1 -1 -1 0.0 0.0 0.0
20 2015091000 3 1 3 3086.0 41.0 PIT 12.0 88.0 5.0 ... 0 0 -1 0 -1 -1 -1 0.0 0.0 0.0
21 2015091000 3 1 1 3054.0 32.0 PIT 22.0 78.0 10.0 ... 0 0 -1 1 PIT Illegal Formation K.Beachum 5.0 0.0 0.0
22 2015091000 3 1 1 3029.0 25.0 PIT 17.0 83.0 15.0 ... 0 0 -1 0 -1 -1 -1 0.0 0.0 0.0
23 2015091000 3 1 2 2988.0 41.0 PIT 20.0 80.0 12.0 ... 1 0 -1 0 -1 -1 -1 0.0 0.0 0.0
24 2015091000 3 1 3 2943.0 45.0 PIT 14.0 86.0 18.0 ... 0 0 -1 0 -1 -1 -1 0.0 0.0 0.0
25 2015091000 3 1 4 2905.0 38.0 PIT 31.0 69.0 1.0 ... 0 0 -1 1 NE -1 M.Slater 10.0 0.0 0.0
26 2015091000 4 1 1 2894.0 11.0 NE 10.0 90.0 10.0 ... 0 0 -1 0 -1 -1 -1 0.0 0.0 0.0
27 2015091000 4 1 2 2860.0 34.0 NE 18.0 82.0 2.0 ... 0 0 -1 0 -1 -1 -1 0.0 0.0 0.0
28 2015091000 4 1 3 2825.0 35.0 NE 19.0 81.0 1.0 ... 0 0 -1 0 -1 -1 -1 0.0 0.0 0.0
29 2015091000 4 1 1 2774.0 51.0 NE 20.0 80.0 10.0 ... 0 0 -1 1 NE -1 N.Solder 10.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
46099 2016010310 18 4 1 482.0 39.0 MIN 36.0 36.0 10.0 ... 0 0 -1 0 -1 -1 -1 0.0 -9.0 9.0
46100 2016010310 18 4 1 450.0 32.0 MIN 25.0 25.0 10.0 ... 0 0 -1 0 -1 -1 -1 0.0 -9.0 9.0
46101 2016010310 18 4 2 415.0 35.0 MIN 22.0 22.0 7.0 ... 0 0 -1 0 -1 -1 -1 0.0 -9.0 9.0
46102 2016010310 18 4 3 382.0 33.0 MIN 18.0 18.0 3.0 ... 1 0 -1 0 -1 -1 -1 0.0 -9.0 9.0
46103 2016010310 18 4 4 339.0 43.0 MIN 25.0 25.0 10.0 ... 0 0 -1 0 -1 -1 -1 0.0 -6.0 6.0
46104 2016010310 19 4 -1 335.0 4.0 GB 35.0 35.0 0.0 ... 0 0 -1 0 -1 -1 -1 0.0 6.0 6.0
46105 2016010310 20 4 1 321.0 14.0 GB 23.0 77.0 10.0 ... 0 0 -1 0 -1 -1 -1 0.0 -6.0 6.0
46106 2016010310 20 4 1 277.0 44.0 GB 40.0 60.0 10.0 ... 0 0 -1 0 -1 -1 -1 0.0 -6.0 6.0
46107 2016010310 20 4 1 242.0 35.0 MIN 23.0 23.0 10.0 ... 0 0 -1 0 -1 -1 -1 0.0 -6.0 6.0
46108 2016010310 20 4 2 236.0 6.0 MIN 23.0 23.0 10.0 ... 0 0 -1 0 -1 -1 -1 0.0 -6.0 6.0
46109 2016010310 20 4 1 197.0 39.0 MIN 10.0 10.0 10.0 ... 0 0 -1 0 -1 -1 -1 0.0 -6.0 6.0
46110 2016010310 20 4 2 190.0 7.0 MIN 10.0 10.0 10.0 ... 1 0 -1 0 -1 -1 -1 0.0 -6.0 6.0
46111 2016010310 20 4 3 147.0 43.0 MIN 13.0 13.0 13.0 ... 0 0 -1 0 -1 -1 -1 0.0 -6.0 6.0
46112 2016010310 20 4 -1 138.0 9.0 MIN 13.0 13.0 0.0 ... 0 0 -1 0 -1 -1 -1 0.0 -1.0 -1.0
46113 2016010310 20 4 4 138.0 0.0 MIN 13.0 13.0 13.0 ... 0 0 -1 0 -1 -1 -1 0.0 -6.0 6.0
46114 2016010310 21 4 1 129.0 9.0 MIN 20.0 80.0 10.0 ... 0 0 -1 0 -1 -1 -1 0.0 6.0 6.0
46115 2016010310 21 4 -1 120.0 9.0 MIN 20.0 20.0 0.0 ... 0 0 -1 0 -1 -1 -1 0.0 -1.0 -1.0
46116 2016010310 21 4 2 120.0 0.0 MIN 20.0 80.0 10.0 ... 0 0 -1 0 -1 -1 -1 0.0 6.0 6.0
46117 2016010310 21 4 -1 113.0 7.0 MIN 20.0 20.0 0.0 ... 0 0 -1 0 -1 -1 -1 0.0 -1.0 -1.0
46118 2016010310 21 4 3 113.0 0.0 MIN 25.0 75.0 5.0 ... 0 0 -1 0 -1 -1 -1 0.0 6.0 6.0
46119 2016010310 21 4 -1 68.0 45.0 MIN 25.0 25.0 0.0 ... 0 0 -1 0 -1 -1 -1 0.0 -1.0 -1.0
46120 2016010310 21 4 4 67.0 1.0 MIN 27.0 73.0 3.0 ... 0 0 -1 0 -1 -1 -1 0.0 6.0 6.0
46121 2016010310 22 4 1 58.0 9.0 GB 42.0 58.0 10.0 ... 0 0 -1 0 -1 -1 -1 0.0 -6.0 6.0
46122 2016010310 22 4 2 54.0 4.0 GB 42.0 58.0 10.0 ... 0 0 -1 0 -1 -1 -1 0.0 -6.0 6.0
46123 2016010310 22 4 3 50.0 4.0 GB 42.0 58.0 10.0 ... 0 0 -1 0 -1 -1 -1 0.0 -6.0 6.0
46124 2016010310 22 4 4 43.0 7.0 GB 42.0 58.0 10.0 ... 0 0 -1 0 -1 -1 -1 0.0 -6.0 6.0
46125 2016010310 22 4 1 27.0 16.0 MIN 46.0 46.0 10.0 ... 0 0 -1 0 -1 -1 -1 0.0 -6.0 6.0
46126 2016010310 22 4 2 24.0 3.0 MIN 39.0 39.0 3.0 ... 0 0 -1 0 -1 -1 -1 0.0 -6.0 6.0
46127 2016010310 22 4 3 15.0 9.0 MIN 39.0 39.0 3.0 ... 0 0 -1 0 -1 -1 -1 0.0 -6.0 6.0
46128 2016010310 22 4 4 1.0 14.0 MIN 38.0 38.0 2.0 ... 0 0 -1 0 -1 -1 -1 0.0 -6.0 6.0

46129 rows × 54 columns


In [13]:
import seaborn as sns
cmap = sns.diverging_palette(220, 10, as_cmap=True) # one of the many color mappings

In [16]:
df.columns


Out[16]:
Index(['GameID', 'Drive', 'qtr', 'down', 'TimeSecs', 'PlayTimeDiff',
       'SideofField', 'yrdln', 'yrdline100', 'ydstogo', 'ydsnet', 'GoalToGo',
       'FirstDown', 'posteam', 'DefensiveTeam', 'desc', 'Yards.Gained', 'sp',
       'Touchdown', 'ExPointResult', 'TwoPointConv', 'DefTwoPoint', 'Safety',
       'PlayType', 'Passer', 'PassOutcome', 'PassLength', 'PassLocation',
       'InterceptionThrown', 'Interceptor', 'Rusher', 'RunLocation', 'RunGap',
       'Receiver', 'Reception', 'ReturnResult', 'Returner', 'Tackler1',
       'Tackler2', 'FieldGoalResult', 'FieldGoalDistance', 'Fumble',
       'RecFumbTeam', 'RecFumbPlayer', 'Sack', 'Challenge.Replay',
       'ChalReplayResult', 'Accepted.Penalty', 'PenalizedTeam', 'PenaltyType',
       'PenalizedPlayer', 'Penalty.Yards', 'ScoreDiff', 'AbsScoreDiff'],
      dtype='object')

In [24]:
df_plot = df[["Yards.Gained", "yrdline100", "TimeSecs", "ydsnet", "ScoreDiff", "posteam"]]
df_plot = df_plot[df_plot.posteam == "DAL"]
#df_imputed_jitter[['Parch','SibSp','Pclass']] = df_imputed_jitter[['Parch','SibSp','Pclass']].values + np.random.rand(len(df_imputed_jitter),3)/2 
#sns.pairplot(df_plot, hue="posteam", size=2)
sns.pairplot(df_plot, size=2)


Out[24]:
<seaborn.axisgrid.PairGrid at 0x138063668>

In [37]:
#sns.violinplot(x="posteam", y="Yards.Gained", hue="PlayType", data=df, inner="quart")
den_plays = df[df.posteam == "DEN"]
den_passes = den_plays[den_plays.PlayType == "Pass"]
den_runs = den_plays[den_plays.PlayType == "Run"]
den_plays = pd.concat([den_passes, den_runs])
sns.violinplot(x="PlayType", y="Yards.Gained", data=den_plays, inner="quart")


Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x13b650c18>

In [38]:
import plotly
print (plotly.__version__)            # version 1.9.x required
plotly.offline.init_notebook_mode() # run at the start of every notebook


1.12.9

In [50]:
from plotly.graph_objs import Scatter, Marker, Layout, XAxis, YAxis

plotly.offline.iplot({
    'data':[
        Scatter(x=den_plays.PlayType,
                y=den_plays["Yards.Gained"],
                text=den_plays.PlayType.values.astype(str),
                marker=Marker(size=den_plays.ydstogo, sizemode='area', sizeref=1,),
                mode='markers')
            ],
    'layout': Layout(xaxis=XAxis(title='Play Type'), yaxis=YAxis(title='Yards Gained'))
}, show_link=False)



In [ ]: