In [2]:

    
import warnings
warnings.filterwarnings('ignore')

0. Data Science 101

An journey through data wrangling, visualization and modeling with some real world data.

All too common question: Where do I start?

Should I get a graduate degree in Statistics or Machine Learning?

NO

The world's data science problems will never be solved by this approach

You are what you repeatedly do/practice.

Mindset is more important than any particular technique

Most important tools:

Curiosity, Patience, Common Sense, Healthy Skepticism

Let's go on a first pass through a dataset, with a focus on exploration.



In [3]:

    
%pylab inline
import pandas as pd
import seaborn as sns









    



Populating the interactive namespace from numpy and matplotlib

1. Meet the data

2015 NFL Regular Season Fanduel data
Dataset contains a row for each player in the league, for each week
Several Categorical variables (full_name, team, week, home_team, away_team)
Several Numeric lag variables (mean_passing_att across all prior weeks, prev_passing_att for one prior week)



In [4]:

    
raw_df = pd.read_csv("/home/brianb/Downloads/odsc_football_modeling_data_2.csv")
df_no_week_1 = raw_df[raw_df.week > 1]

What columns do we have?



In [5]:

    
ff_cols = raw_df.columns
sort(ff_cols.values)









    Out[5]:





array(['away_team', 'fanduel_points', 'full_name', 'home_team',
       'mean_fanduel_points', 'mean_fumbles_lost', 'mean_fumbles_tot',
       'mean_passing_att', 'mean_passing_cmp', 'mean_passing_cmp_air_yds',
       'mean_passing_incmp', 'mean_passing_incmp_air_yds',
       'mean_passing_int', 'mean_passing_sk', 'mean_passing_sk_yds',
       'mean_passing_tds', 'mean_passing_yds', 'mean_receiving_rec',
       'mean_receiving_tar', 'mean_receiving_tds',
       'mean_receiving_yac_yds', 'mean_receiving_yds', 'mean_rushing_att',
       'mean_rushing_loss', 'mean_rushing_loss_yds', 'mean_rushing_tds',
       'mean_rushing_yds', 'opponent', 'player_id', 'position',
       'prev_fanduel_points', 'prev_fumbles_lost', 'prev_fumbles_tot',
       'prev_passing_att', 'prev_passing_cmp', 'prev_passing_cmp_air_yds',
       'prev_passing_incmp', 'prev_passing_incmp_air_yds',
       'prev_passing_int', 'prev_passing_sk', 'prev_passing_sk_yds',
       'prev_passing_tds', 'prev_passing_yds', 'prev_receiving_rec',
       'prev_receiving_tar', 'prev_receiving_tds',
       'prev_receiving_yac_yds', 'prev_receiving_yds', 'prev_rushing_att',
       'prev_rushing_loss', 'prev_rushing_loss_yds', 'prev_rushing_tds',
       'prev_rushing_yds', 'team', 'week'], dtype=object)

A quick look at our data



In [6]:

    
raw_df[ff_cols].head()









    Out[6]:






  
    
      
      player_id
      full_name
      position
      team
      week
      fanduel_points
      opponent
      home_team
      away_team
      prev_fanduel_points
      ...
      prev_rushing_att
      mean_rushing_att
      prev_rushing_loss
      mean_rushing_loss
      prev_rushing_loss_yds
      mean_rushing_loss_yds
      prev_rushing_tds
      mean_rushing_tds
      prev_rushing_yds
      mean_rushing_yds
    
  
  
    
      0
      00-0019596
      Tom Brady
      QB
      NE
      1
      27.62
      PIT
      NE
      PIT
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1
      00-0020712
      James Harrison
      OLB
      PIT
      1
      0.00
      NE
      NE
      PIT
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2
      00-0022874
      Josh Scobee
      K
      NO
      1
      8.00
      PIT
      NE
      PIT
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      3
      00-0022900
      Will Allen
      UNK
      UNK
      1
      0.00
      PIT
      NE
      PIT
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      4
      00-0022924
      Ben Roethlisberger
      QB
      PIT
      1
      17.04
      NE
      NE
      PIT
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
  

5 rows × 55 columns



In [7]:

    
df_no_week_1[ff_cols].head()









    Out[7]:






  
    
      
      player_id
      full_name
      position
      team
      week
      fanduel_points
      opponent
      home_team
      away_team
      prev_fanduel_points
      ...
      prev_rushing_att
      mean_rushing_att
      prev_rushing_loss
      mean_rushing_loss
      prev_rushing_loss_yds
      mean_rushing_loss_yds
      prev_rushing_tds
      mean_rushing_tds
      prev_rushing_yds
      mean_rushing_yds
    
  
  
    
      1021
      00-0010346
      Peyton Manning
      UNK
      UNK
      2
      21.24
      DEN
      KC
      DEN
      5.90
      ...
      1
      1
      0
      0
      0
      0
      0
      0
      -1
      -1
    
    
      1022
      00-0022793
      Antonio Smith
      UNK
      UNK
      2
      0.00
      DEN
      KC
      DEN
      0.00
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1023
      00-0023436
      Alex Smith
      QB
      KC
      2
      7.14
      DEN
      KC
      DEN
      23.22
      ...
      9
      9
      0
      0
      0
      0
      0
      0
      15
      15
    
    
      1024
      00-0023445
      DeMarcus Ware
      OLB
      DEN
      2
      0.00
      KC
      KC
      DEN
      0.00
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1025
      00-0023449
      Derrick Johnson
      ILB
      KC
      2
      0.00
      DEN
      KC
      DEN
      0.00
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
  

5 rows × 55 columns

Where do we begin?

We know some Python. Let's use that.

First thing to figure out is how to get the data files on your machine into Python in the first place.

We can use Pandas.

Pandas

Pandas is a Python library you may have heard of that is great for exploring data in real time. Let's look at what we already did with Pandas and what else we can do!

Reading data from disk



In [9]:

    
raw_df = pd.read_csv("/home/brianb/Downloads/odsc_football_modeling_data_2.csv")

What did that just do?

We called "read_csv", which presumably reads CSV files... and does what with them?

pd.read_csv does a magical thing

It reads a CSV file into a DataFrame.

DataFrames are mystical creatures in Data Science.

Popularized by R, they provide a standardized matrix-style format for interacting with your data. Most data can fit into this row and column format: financial transactions, iPhone app user records, medical histories, etc.

Since you were wondering

Pandas has support for many formats

CSV, Text (tab separated, pipe separated, etc.), Excel, JSON, HTML, SQL, Stuff copied to your clipboard, HDFS...

Hold up. What's really going on with these DataFrames?

Two data structures: Series and DataFrame

Series

Think of this as one column of your data - one data type.

DataFrame

All of the columns in your data. Mixed data types.

Many Series can be combined and represented as a DataFrame object.

A DataFrame can be represented as many Series objects.

Pandas provides tons of functions to:

slice, dice, merge, join, group by, select, append, find, transform, sort, reverse, pivot and anything else you want to do

... for both Series and DataFrames.

Most functions are designed to work with either type or even combinations of the two, just like you would intuitively expect:

i.e. A concat function can contatenate arbitrary combinations of 0 to n Series and DataFrames.

Accessing data in a DataFrame

Get one column (Series)



In [10]:

    
raw_df['full_name'].head()









    Out[10]:





0             Tom Brady
1        James Harrison
2           Josh Scobee
3            Will Allen
4    Ben Roethlisberger
Name: full_name, dtype: object

Get subset of columns (DataFrame)



In [11]:

    
raw_df[['full_name', 'position', 'team']].head()









    Out[11]:






  
    
      
      full_name
      position
      team
    
  
  
    
      0
      Tom Brady
      QB
      NE
    
    
      1
      James Harrison
      OLB
      PIT
    
    
      2
      Josh Scobee
      K
      NO
    
    
      3
      Will Allen
      UNK
      UNK
    
    
      4
      Ben Roethlisberger
      QB
      PIT

Get a subset of rows using a boolean array



In [13]:

    
raw_df[raw_df.week > 1].head()









    Out[13]:






  
    
      
      player_id
      full_name
      position
      team
      week
      fanduel_points
      opponent
      home_team
      away_team
      prev_fanduel_points
      ...
      prev_rushing_att
      mean_rushing_att
      prev_rushing_loss
      mean_rushing_loss
      prev_rushing_loss_yds
      mean_rushing_loss_yds
      prev_rushing_tds
      mean_rushing_tds
      prev_rushing_yds
      mean_rushing_yds
    
  
  
    
      1021
      00-0010346
      Peyton Manning
      UNK
      UNK
      2
      21.24
      DEN
      KC
      DEN
      5.90
      ...
      1
      1
      0
      0
      0
      0
      0
      0
      -1
      -1
    
    
      1022
      00-0022793
      Antonio Smith
      UNK
      UNK
      2
      0.00
      DEN
      KC
      DEN
      0.00
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1023
      00-0023436
      Alex Smith
      QB
      KC
      2
      7.14
      DEN
      KC
      DEN
      23.22
      ...
      9
      9
      0
      0
      0
      0
      0
      0
      15
      15
    
    
      1024
      00-0023445
      DeMarcus Ware
      OLB
      DEN
      2
      0.00
      KC
      KC
      DEN
      0.00
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1025
      00-0023449
      Derrick Johnson
      ILB
      KC
      2
      0.00
      DEN
      KC
      DEN
      0.00
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
  

5 rows × 55 columns

So now we know about DataFrames

...What's next?

Exploratory Data Analysis

Does it look like test data should?

Is it completely empty? Full? Lots of missing values and NaNs?

What are in the rows? columns?

Does it have appropriate features? (characteristics common to records belonging to a dataset)

It's impossible to make good decisions moving forward until we know more

We can just output the entire dataframe to the console, but that doesn't scale beyond a couple hundred rows.

In [1]: df = DataFrame(np.random.randn(10, 4))

In [2]: df
Out[2]: 
          0         1         2         3
0  0.469112 -0.282863 -1.509059 -1.135632
1  1.212112 -0.173215  0.119209 -1.044236
2 -0.861849 -2.104569 -0.494929  1.071804
3  0.721555 -0.706771 -1.039575  0.271860
4 -0.424972  0.567020  0.276232 -1.087401
5 -0.673690  0.113648 -1.478427  0.524988
6  0.404705  0.577046 -1.715002 -1.039268
7 -0.370647 -1.157892 -1.344312  0.844885
8  1.075770 -0.109050  1.643563 -1.469388
9  0.357021 -0.674600 -1.776904 -0.968914

Look for missing values



In [16]:

    
df_no_week_1 = raw_df[raw_df.week > 1]
df_no_week_1.isnull().sum()









    Out[16]:





player_id                       0
full_name                       0
position                        0
team                            0
week                            0
fanduel_points                  0
opponent                        0
home_team                       0
away_team                       0
prev_fanduel_points           823
mean_fanduel_points           823
prev_fumbles_lost             823
mean_fumbles_lost             823
prev_fumbles_tot              823
mean_fumbles_tot              823
prev_passing_att              823
mean_passing_att              823
prev_passing_cmp              823
mean_passing_cmp              823
prev_passing_cmp_air_yds      823
mean_passing_cmp_air_yds      823
prev_passing_incmp            823
mean_passing_incmp            823
prev_passing_incmp_air_yds    823
mean_passing_incmp_air_yds    823
prev_passing_int              823
mean_passing_int              823
prev_passing_sk               823
mean_passing_sk               823
prev_passing_sk_yds           823
mean_passing_sk_yds           823
prev_passing_tds              823
mean_passing_tds              823
prev_passing_yds              823
mean_passing_yds              823
prev_receiving_rec            823
mean_receiving_rec            823
prev_receiving_tar            823
mean_receiving_tar            823
prev_receiving_tds            823
mean_receiving_tds            823
prev_receiving_yac_yds        823
mean_receiving_yac_yds        823
prev_receiving_yds            823
mean_receiving_yds            823
prev_rushing_att              823
mean_rushing_att              823
prev_rushing_loss             823
mean_rushing_loss             823
prev_rushing_loss_yds         823
mean_rushing_loss_yds         823
prev_rushing_tds              823
mean_rushing_tds              823
prev_rushing_yds              823
mean_rushing_yds              823
dtype: int64

What about relationships between variables?

Should we compute covariance?

Nah, let's make some plots!

A quick aside.

IPython Notebooks (like this one) are great.

Plots in your notebook!

Two plotting weapons

Matplotlib

The historical go-to for plotting
allows lots of fine-grained control
built with numpy in mind (Numpy and its cousin Scipy are the number-crunching go-to's in Python)

Seaborn

Expressive power
built with pandas in mind
trendy newcomer, but gaining a loyal following

We will mainly use seaborn examples in this presentation. It's very intuitive and powerful to use.

What is the distribution of points earned?



In [10]:

    
pylab.hist(raw_df['fanduel_points'],
               normed=True,
               bins=np.linspace(-1, 35, 12),
               alpha=0.35,
               label='fanduel_points')
pylab.legend()
pylab.figure(figsize=(15,15))









    Out[10]:





<matplotlib.figure.Figure at 0x6e83a50>






    












    





<matplotlib.figure.Figure at 0x6e83a50>

Target is skewed --- let's transform it so it's more normal



In [11]:

    
transformed_target = pd.DataFrame.copy(raw_df[raw_df.fanduel_points > 1])
transformed_target['fanduel_points'] = np.log(raw_df[raw_df.fanduel_points > 0]['fanduel_points']-1)
pylab.figure(figsize=(15,15))
pylab.hist(transformed_target['fanduel_points'],
               normed=True,
               bins=np.linspace(-1, 4, 100),
               alpha=0.35,
               label='fanduel_points')
pylab.legend()









    Out[11]:





<matplotlib.legend.Legend at 0x6596f50>

I see lots of zeroes --- Where are these bad performances coming from?



In [13]:

    
no_nans = raw_df[raw_df[raw_df.fanduel_points< 3].notnull()]

no_nans = raw_df[raw_df.position != 'UNK'] #remove Unknowns from this dataframe

no_nans.groupby('position').size()

bad_performances = pd.DataFrame({'count' : no_nans.groupby('position').size()}).reset_index()

bad_performances = bad_performances.sort(['count'], ascending=[0])

print bad_performances

g = sns.factorplot("position", "count",
                    data=bad_performances, kind="bar",
                    size=15, palette="pastel", dropna=True, x_order=bad_performances.position.values)









    



   position  count
24       WR   1822
3        DE   1276
19       RB   1226
1        CB   1161
15      OLB   1139
23       TE    948
2        DB    754
4        DT    724
10       LB    625
21       SS    532
18       QB    509
6        FS    500
9         K    462
17        P    443
8       ILB    414
22        T    332
13       NT    229
12      MLB    210
5        FB    175
7         G    147
0         C    142
14       OG    107
16       OT     94
11       LS     73
20      SAF      1



In [14]:

    
#Let's look at an individual player
raw_df[(raw_df.full_name =='Tom Brady') & (raw_df.week == 2)]









    Out[14]:






  
    
      
      player_id
      full_name
      position
      team
      week
      fanduel_points
      opponent
      home_team
      away_team
      prev_fanduel_points
      ...
      prev_rushing_att
      mean_rushing_att
      prev_rushing_loss
      mean_rushing_loss
      prev_rushing_loss_yds
      mean_rushing_loss_yds
      prev_rushing_tds
      mean_rushing_tds
      prev_rushing_yds
      mean_rushing_yds
    
  
  
    
      1149
      00-0019596
      Tom Brady
      QB
      NE
      2
      28.24
      BUF
      BUF
      NE
      27.62
      ...
      3
      3
      0
      0
      0
      0
      0
      0
      1
      1
    
  

1 rows × 55 columns



In [15]:

    
#All positions are not created equal -- some score many more points than others
raw_df[raw_df[raw_df.fanduel_points> 1].notnull()].groupby('position')['fanduel_points'].sum()









    Out[15]:





position
CB         6.00
DB         9.00
FB       363.10
G          1.50
K       3506.12
NT         6.10
OT         3.60
P          5.80
QB      7861.24
RB      8849.84
T          1.20
TE      4656.70
UNK     5782.82
WR     12457.80
Name: fanduel_points, dtype: float64



In [19]:

    
#Since a few positions seem to score all the points, let's zoom in on those
plot_order = ['TE', 'WR', 'RB', 
              'K', 'QB']
top_positions_only = raw_df[raw_df.position.isin(plot_order)]
top_positions_only.groupby('position')['fanduel_points'].mean()









    Out[19]:





position
K      7.623636
QB    15.448330
RB     7.244731
TE     4.967405
WR     6.854775
Name: fanduel_points, dtype: float64



In [20]:

    
# Violin plots are a nice alternative to boxplots that also show interesting detail about
# the shape of the distribution
nonnull_subset = top_positions_only['fanduel_points'].notnull()
plt.figure(figsize=(12, 6))
sns.violinplot(top_positions_only['fanduel_points'][nonnull_subset], 
               top_positions_only['position'][nonnull_subset], 
               inner='box',
               order=plot_order,
               bw=1,
               size=16)









    Out[20]:





<matplotlib.axes._subplots.AxesSubplot at 0x5dd35d0>



In [18]:

    
# QB's and kickers score the most points. Let's look into those using a Histogram
qb_k = ['K', 'QB']
qb_k_data = raw_df[raw_df.position.isin(qb_k)]
groups = qb_k_data.groupby('position').groups

pylab.figure(figsize=(15,5))
for key, row_ids in groups.iteritems():
    pylab.hist(qb_k_data['fanduel_points'][row_ids].values,
               normed=True,
               bins=np.linspace(-10, 50, 50),
               alpha=0.35,
               label=str(key))
pylab.legend()









    Out[18]:





<matplotlib.legend.Legend at 0x9721210>



In [19]:

    
# We can use a Facetgrid to analyze teams

def vertical_mean_line(x, **kwargs):
    plt.axvline(np.percentile(x, 95), **kwargs)
    
teams = ['NE', 'CHI']
team_data = raw_df[raw_df.team.isin(teams)]
team_data = team_data[team_data.week < 4]

g = sns.FacetGrid(team_data, row="team", col="week", 
                  margin_titles=True, dropna=True, size=4)
bins = np.linspace(-3, 30, 30)
g.map(plt.hist, "fanduel_points", color="black", bins=bins, 
      lw=0, normed=True)
g.map(vertical_mean_line, 'fanduel_points')









    Out[19]:





<seaborn.axisgrid.FacetGrid at 0xa7d8b10>



In [20]:

    
# We can use a Heatmap to analyze teams kicker and qb performance
teams = ['NE', 'CHI', 'NYG', 'DET', 'NYJ']
team_data = qb_k_data[qb_k_data.team.isin(teams)]
team_data = team_data[team_data.week < 10]
ptable = pd.pivot_table(
    team_data, 
    values='fanduel_points', 
    index=["team"], 
    columns='week')
reorder_teams = ptable.reindex(teams).fillna(0)
pylab.figure(figsize=(15,5))
sns.heatmap(reorder_teams.astype(int), annot=True, fmt="d", cmap="YlGnBu")

# Zero values are bye weeks









    Out[20]:





<matplotlib.axes._subplots.AxesSubplot at 0xb8295d0>



In [21]:

    
# Are previous week's points a good predictor of current week's points? 
# Let's consider only kicker and QB data for these teams
# We have to exclude week 1 here since there is no previous weeks' mean
team_data_no_week_1 = team_data[team_data.week > 1]
grid = sns.JointGrid(team_data_no_week_1['mean_fanduel_points'], 
              team_data_no_week_1['fanduel_points'],space=0, size=10, ratio=50)
grid.plot_joint(plt.scatter, color="g")
grid.plot_marginals(sns.rugplot, height=1, color="g")









    Out[21]:





<seaborn.axisgrid.JointGrid at 0x88108d0>



In [22]:

    
# We can use jointplot (uses JointGrid internally) to get a quick regression line for this
sns.jointplot('mean_fanduel_points', 'fanduel_points', data=team_data_no_week_1, 
              kind="reg", color=sns.color_palette()[1], size=9)









    Out[22]:





<seaborn.axisgrid.JointGrid at 0xa7d40d0>



In [23]:

    
# QB's are significantly more important than any other position. Let's dig in
qb_df = raw_df[raw_df.position == 'QB']



In [24]:

    
# Passing attempts from previous weeks --- is there a trend with next week's performance? No
sns.jointplot(qb_df['mean_passing_att'], 
              qb_df['fanduel_points'], kind="reg", size=9)









    Out[24]:





<seaborn.axisgrid.JointGrid at 0xab0ee10>

Modeling

Scikit-Learn

Widely used machine learning package

Classification Models
Regression Models
Clustering techniques
Dimensionality Reduction
Preprocessing
...

Experimental Design

Build a model that is useful

Can only build model on data that has a known output

If we had missing values in our target (output variable) we would want to be careful

Avoiding overfitting

Need to keep data we train on from data we validate on
Otherwise the results will be overly optimistic
... and ultimately, the model will perform poorly on new data

We will setup our training and test set in a bit



In [25]:

    
import sklearn



In [21]:

    
# Let's prep for modeling
exclude_week_1 = top_positions_only[top_positions_only.week > 1]

model_data = pd.DataFrame.copy(exclude_week_1)

model_data = model_data[model_data.fanduel_points > 0]

print np.isnan(model_data['fanduel_points']).sum()

# Let's cut our target out so we don't train on it
target = model_data.pop('fanduel_points')

# We don't need player id's --- let's throw this away
throw_away = model_data.pop('player_id')

import sklearn.cross_validation

(train_data, 
 test_data, 
 train_target, 
 test_target) = sklearn.cross_validation.train_test_split(
    model_data, target, test_size=0.2, random_state=1337
)

Today we will use simple train/test split

Variable Preprocessing

Most statistical models require numeric encoding

Many will choke on missing values

Need some massaging before fitting a statistical learner



In [27]:

    
#Handle categorical vars
import sklearn.preprocessing
import sklearn.feature_extraction
from sklearn.feature_extraction import DictVectorizer
encoder = DictVectorizer(sparse=False)

#Let's do one-hot encoding in sklearn using DictVectorizer
categorical_vars = ['full_name', 'position', 'team', 'week', 'opponent', 'home_team', 'away_team']
vardata = train_data[categorical_vars].fillna('MISSING')
encoder.fit(vardata.to_dict(orient='records'))
train_catdata = encoder.transform(vardata.to_dict(orient='records'))

test_vardata = test_data[categorical_vars].fillna('MISSING')
test_catdata = encoder.transform(
    test_vardata[categorical_vars].to_dict(orient='records'))

pd.DataFrame(train_catdata).describe()









    Out[27]:






  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
      ...
      560
      561
      562
      563
      564
      565
      566
      567
      568
      569
    
  
  
    
      count
      3160.000000
      3160.000000
      3160.000000
      3160.000000
      3160.000000
      3160.000000
      3160.000000
      3160.000000
      3160.000000
      3160.000000
      ...
      3160.000000
      3160.000000
      3160.000000
      3160.000000
      3160.000000
      3160.000000
      3160.000000
      3160.000000
      3160.000000
      3160.000000
    
    
      mean
      0.034810
      0.028797
      0.032278
      0.033861
      0.028165
      0.035443
      0.029747
      0.029430
      0.037975
      0.030696
      ...
      0.037658
      0.039241
      0.029114
      0.033544
      0.031962
      0.029747
      0.028797
      0.044304
      0.037975
      9.587658
    
    
      std
      0.183328
      0.167263
      0.176767
      0.180899
      0.165469
      0.184926
      0.169915
      0.169036
      0.191165
      0.172521
      ...
      0.190398
      0.194197
      0.168152
      0.180081
      0.175927
      0.169915
      0.167263
      0.205802
      0.191165
      4.691149
    
    
      min
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      ...
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      2.000000
    
    
      25%
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      ...
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      5.000000
    
    
      50%
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      ...
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      10.000000
    
    
      75%
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      ...
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      14.000000
    
    
      max
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      ...
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      17.000000
    
  

8 rows × 570 columns

Missing values in numeric columns

We will impute with the median value



In [28]:

    
#Handle numeric vars
from sklearn.preprocessing import Imputer
imputer = Imputer(strategy='median')

numeric_vars = list(set(train_data.columns.tolist()) - set(categorical_vars))

numdata = train_data[numeric_vars]
imputer.fit(numdata)

train_numdata = imputer.transform(numdata)
test_numdata = imputer.transform(test_data[numeric_vars])



In [29]:

    
train_this = np.hstack([train_numdata, train_catdata])
test_this = np.hstack([test_numdata, test_catdata])



In [30]:

    
import sklearn
from sklearn.linear_model import LinearRegression

print np.any(isnan(train_numdata))
print np.all(np.isfinite(train_numdata))

lr = LinearRegression(fit_intercept=False)
lr.fit(train_numdata, train_target)

lr_predictions = pd.Series(lr.predict(test_numdata),
                           name='Linear Regression')









    



False
True



In [31]:

    
p_df = pd.DataFrame({'Prediction': lr_predictions,
                     'Actual': test_target.values})

pylab.figure(figsize=(10, 10))
sns.jointplot('Actual', 'Prediction', data=p_df, 
              kind="hex", color=sns.color_palette()[1])
#Let's take a look at our residuals using using just the categorical vars









    Out[31]:





<seaborn.axisgrid.JointGrid at 0xd256590>






    





<matplotlib.figure.Figure at 0xc29a450>



In [32]:

    
from sklearn import metrics

test_metrics = {
    'Explained Variance': metrics.explained_variance_score,
    'MAE': metrics.mean_absolute_error,
    'MSE': metrics.mean_squared_error,
    'MedAE': metrics.median_absolute_error,
    'R2': metrics.r2_score
}
def metrics_report(*predictions):
    records = []
    for prediction_set in predictions:
        record = {'name': prediction_set.name}
        for metric_name in sorted(test_metrics.keys()):
            metric_func = test_metrics[metric_name]
            record[metric_name] = metric_func(test_target, prediction_set)
        records.append(record)
    frame = pd.DataFrame.from_records(records).set_index('name')
    return frame
        
metrics_report(lr_predictions)









    Out[32]:






  
    
      
      Explained Variance
      MAE
      MSE
      MedAE
      R2
    
    
      name
      
      
      
      
      
    
  
  
    
      Linear Regression
      0.335165
      4.444197
      36.212686
      3.399439
      0.331252



In [33]:

    
# We need to add reference models to track a baseline performance that we can compare our other models to
mean_response = np.mean(train_target)
mean_predictions = pd.Series(np.ones_like(test_target) * mean_response,
                             name='Mean Response')

median_response = np.median(train_target)
median_predictions = pd.Series(np.ones_like(test_target) * median_response,
                               name='Median Response')

metrics_report(mean_predictions, 
               median_predictions, 
               lr_predictions)









    Out[33]:






  
    
      
      Explained Variance
      MAE
      MSE
      MedAE
      R2
    
    
      name
      
      
      
      
      
    
  
  
    
      Mean Response
      0.000000
      5.916924
      54.151808
      5.627266
      -0.000034
    
    
      Median Response
      0.000000
      5.746785
      57.647159
      4.800000
      -0.064583
    
    
      Linear Regression
      0.335165
      4.444197
      36.212686
      3.399439
      0.331252



In [34]:

    
#Time for ElasticNet

from sklearn.grid_search import GridSearchCV
from sklearn.linear_model import ElasticNet

estimator = ElasticNet()

parameters = {
    'alpha': np.linspace(0.1, 2, 10, endpoint=True),
    'l1_ratio': np.linspace(0, 1, 10, endpoint=True)
}

enet = GridSearchCV(estimator, parameters)
enet.fit(train_numdata, train_target)









    Out[34]:





GridSearchCV(cv=None, error_score='raise',
       estimator=ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True, l1_ratio=0.5,
      max_iter=1000, normalize=False, positive=False, precompute=False,
      random_state=None, selection='cyclic', tol=0.0001, warm_start=False),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'alpha': array([ 0.1    ,  0.31111,  0.52222,  0.73333,  0.94444,  1.15556,
        1.36667,  1.57778,  1.78889,  2.     ]), 'l1_ratio': array([ 0.     ,  0.11111,  0.22222,  0.33333,  0.44444,  0.55556,
        0.66667,  0.77778,  0.88889,  1.     ])},
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
       verbose=0)



In [35]:

    
print(enet.best_params_, enet.best_score_)
print()
print("Grid scores on development set:")
print()
for params, mean_score, scores in enet.grid_scores_:
    print("%0.3f (+/-%0.03f) for %r" % (mean_score, scores.std() * 2, params))
print()









    



({'alpha': 0.52222222222222225, 'l1_ratio': 0.55555555555555558}, 0.27276120897428852)
()
Grid scores on development set:
()
0.270 (+/-0.061) for {'alpha': 0.10000000000000001, 'l1_ratio': 0.0}
0.270 (+/-0.061) for {'alpha': 0.10000000000000001, 'l1_ratio': 0.1111111111111111}
0.270 (+/-0.060) for {'alpha': 0.10000000000000001, 'l1_ratio': 0.22222222222222221}
0.270 (+/-0.060) for {'alpha': 0.10000000000000001, 'l1_ratio': 0.33333333333333331}
0.271 (+/-0.059) for {'alpha': 0.10000000000000001, 'l1_ratio': 0.44444444444444442}
0.271 (+/-0.059) for {'alpha': 0.10000000000000001, 'l1_ratio': 0.55555555555555558}
0.271 (+/-0.059) for {'alpha': 0.10000000000000001, 'l1_ratio': 0.66666666666666663}
0.271 (+/-0.059) for {'alpha': 0.10000000000000001, 'l1_ratio': 0.77777777777777768}
0.271 (+/-0.059) for {'alpha': 0.10000000000000001, 'l1_ratio': 0.88888888888888884}
0.271 (+/-0.058) for {'alpha': 0.10000000000000001, 'l1_ratio': 1.0}
0.271 (+/-0.061) for {'alpha': 0.31111111111111112, 'l1_ratio': 0.0}
0.271 (+/-0.060) for {'alpha': 0.31111111111111112, 'l1_ratio': 0.1111111111111111}
0.271 (+/-0.059) for {'alpha': 0.31111111111111112, 'l1_ratio': 0.22222222222222221}
0.272 (+/-0.058) for {'alpha': 0.31111111111111112, 'l1_ratio': 0.33333333333333331}
0.272 (+/-0.058) for {'alpha': 0.31111111111111112, 'l1_ratio': 0.44444444444444442}
0.272 (+/-0.058) for {'alpha': 0.31111111111111112, 'l1_ratio': 0.55555555555555558}
0.273 (+/-0.057) for {'alpha': 0.31111111111111112, 'l1_ratio': 0.66666666666666663}
0.273 (+/-0.057) for {'alpha': 0.31111111111111112, 'l1_ratio': 0.77777777777777768}
0.273 (+/-0.057) for {'alpha': 0.31111111111111112, 'l1_ratio': 0.88888888888888884}
0.273 (+/-0.056) for {'alpha': 0.31111111111111112, 'l1_ratio': 1.0}
0.272 (+/-0.060) for {'alpha': 0.52222222222222225, 'l1_ratio': 0.0}
0.272 (+/-0.058) for {'alpha': 0.52222222222222225, 'l1_ratio': 0.1111111111111111}
0.272 (+/-0.058) for {'alpha': 0.52222222222222225, 'l1_ratio': 0.22222222222222221}
0.273 (+/-0.057) for {'alpha': 0.52222222222222225, 'l1_ratio': 0.33333333333333331}
0.273 (+/-0.057) for {'alpha': 0.52222222222222225, 'l1_ratio': 0.44444444444444442}
0.273 (+/-0.056) for {'alpha': 0.52222222222222225, 'l1_ratio': 0.55555555555555558}
0.273 (+/-0.056) for {'alpha': 0.52222222222222225, 'l1_ratio': 0.66666666666666663}
0.272 (+/-0.055) for {'alpha': 0.52222222222222225, 'l1_ratio': 0.77777777777777768}
0.272 (+/-0.055) for {'alpha': 0.52222222222222225, 'l1_ratio': 0.88888888888888884}
0.272 (+/-0.054) for {'alpha': 0.52222222222222225, 'l1_ratio': 1.0}
0.272 (+/-0.059) for {'alpha': 0.73333333333333328, 'l1_ratio': 0.0}
0.272 (+/-0.057) for {'alpha': 0.73333333333333328, 'l1_ratio': 0.1111111111111111}
0.273 (+/-0.057) for {'alpha': 0.73333333333333328, 'l1_ratio': 0.22222222222222221}
0.273 (+/-0.056) for {'alpha': 0.73333333333333328, 'l1_ratio': 0.33333333333333331}
0.273 (+/-0.055) for {'alpha': 0.73333333333333328, 'l1_ratio': 0.44444444444444442}
0.272 (+/-0.055) for {'alpha': 0.73333333333333328, 'l1_ratio': 0.55555555555555558}
0.272 (+/-0.054) for {'alpha': 0.73333333333333328, 'l1_ratio': 0.66666666666666663}
0.272 (+/-0.053) for {'alpha': 0.73333333333333328, 'l1_ratio': 0.77777777777777768}
0.272 (+/-0.052) for {'alpha': 0.73333333333333328, 'l1_ratio': 0.88888888888888884}
0.271 (+/-0.051) for {'alpha': 0.73333333333333328, 'l1_ratio': 1.0}
0.272 (+/-0.058) for {'alpha': 0.94444444444444442, 'l1_ratio': 0.0}
0.272 (+/-0.057) for {'alpha': 0.94444444444444442, 'l1_ratio': 0.1111111111111111}
0.273 (+/-0.056) for {'alpha': 0.94444444444444442, 'l1_ratio': 0.22222222222222221}
0.273 (+/-0.055) for {'alpha': 0.94444444444444442, 'l1_ratio': 0.33333333333333331}
0.272 (+/-0.055) for {'alpha': 0.94444444444444442, 'l1_ratio': 0.44444444444444442}
0.272 (+/-0.054) for {'alpha': 0.94444444444444442, 'l1_ratio': 0.55555555555555558}
0.272 (+/-0.052) for {'alpha': 0.94444444444444442, 'l1_ratio': 0.66666666666666663}
0.271 (+/-0.051) for {'alpha': 0.94444444444444442, 'l1_ratio': 0.77777777777777768}
0.271 (+/-0.052) for {'alpha': 0.94444444444444442, 'l1_ratio': 0.88888888888888884}
0.271 (+/-0.052) for {'alpha': 0.94444444444444442, 'l1_ratio': 1.0}
0.272 (+/-0.058) for {'alpha': 1.1555555555555557, 'l1_ratio': 0.0}
0.272 (+/-0.056) for {'alpha': 1.1555555555555557, 'l1_ratio': 0.1111111111111111}
0.273 (+/-0.055) for {'alpha': 1.1555555555555557, 'l1_ratio': 0.22222222222222221}
0.272 (+/-0.055) for {'alpha': 1.1555555555555557, 'l1_ratio': 0.33333333333333331}
0.272 (+/-0.054) for {'alpha': 1.1555555555555557, 'l1_ratio': 0.44444444444444442}
0.271 (+/-0.052) for {'alpha': 1.1555555555555557, 'l1_ratio': 0.55555555555555558}
0.271 (+/-0.052) for {'alpha': 1.1555555555555557, 'l1_ratio': 0.66666666666666663}
0.271 (+/-0.052) for {'alpha': 1.1555555555555557, 'l1_ratio': 0.77777777777777768}
0.271 (+/-0.052) for {'alpha': 1.1555555555555557, 'l1_ratio': 0.88888888888888884}
0.270 (+/-0.052) for {'alpha': 1.1555555555555557, 'l1_ratio': 1.0}
0.272 (+/-0.057) for {'alpha': 1.3666666666666667, 'l1_ratio': 0.0}
0.272 (+/-0.056) for {'alpha': 1.3666666666666667, 'l1_ratio': 0.1111111111111111}
0.272 (+/-0.055) for {'alpha': 1.3666666666666667, 'l1_ratio': 0.22222222222222221}
0.272 (+/-0.054) for {'alpha': 1.3666666666666667, 'l1_ratio': 0.33333333333333331}
0.271 (+/-0.053) for {'alpha': 1.3666666666666667, 'l1_ratio': 0.44444444444444442}
0.271 (+/-0.052) for {'alpha': 1.3666666666666667, 'l1_ratio': 0.55555555555555558}
0.271 (+/-0.052) for {'alpha': 1.3666666666666667, 'l1_ratio': 0.66666666666666663}
0.270 (+/-0.052) for {'alpha': 1.3666666666666667, 'l1_ratio': 0.77777777777777768}
0.270 (+/-0.053) for {'alpha': 1.3666666666666667, 'l1_ratio': 0.88888888888888884}
0.269 (+/-0.053) for {'alpha': 1.3666666666666667, 'l1_ratio': 1.0}
0.272 (+/-0.057) for {'alpha': 1.5777777777777779, 'l1_ratio': 0.0}
0.272 (+/-0.056) for {'alpha': 1.5777777777777779, 'l1_ratio': 0.1111111111111111}
0.272 (+/-0.055) for {'alpha': 1.5777777777777779, 'l1_ratio': 0.22222222222222221}
0.272 (+/-0.053) for {'alpha': 1.5777777777777779, 'l1_ratio': 0.33333333333333331}
0.271 (+/-0.052) for {'alpha': 1.5777777777777779, 'l1_ratio': 0.44444444444444442}
0.271 (+/-0.052) for {'alpha': 1.5777777777777779, 'l1_ratio': 0.55555555555555558}
0.270 (+/-0.052) for {'alpha': 1.5777777777777779, 'l1_ratio': 0.66666666666666663}
0.270 (+/-0.053) for {'alpha': 1.5777777777777779, 'l1_ratio': 0.77777777777777768}
0.269 (+/-0.053) for {'alpha': 1.5777777777777779, 'l1_ratio': 0.88888888888888884}
0.269 (+/-0.053) for {'alpha': 1.5777777777777779, 'l1_ratio': 1.0}
0.272 (+/-0.057) for {'alpha': 1.788888888888889, 'l1_ratio': 0.0}
0.272 (+/-0.055) for {'alpha': 1.788888888888889, 'l1_ratio': 0.1111111111111111}
0.272 (+/-0.055) for {'alpha': 1.788888888888889, 'l1_ratio': 0.22222222222222221}
0.271 (+/-0.053) for {'alpha': 1.788888888888889, 'l1_ratio': 0.33333333333333331}
0.271 (+/-0.052) for {'alpha': 1.788888888888889, 'l1_ratio': 0.44444444444444442}
0.270 (+/-0.052) for {'alpha': 1.788888888888889, 'l1_ratio': 0.55555555555555558}
0.270 (+/-0.053) for {'alpha': 1.788888888888889, 'l1_ratio': 0.66666666666666663}
0.269 (+/-0.053) for {'alpha': 1.788888888888889, 'l1_ratio': 0.77777777777777768}
0.268 (+/-0.053) for {'alpha': 1.788888888888889, 'l1_ratio': 0.88888888888888884}
0.268 (+/-0.053) for {'alpha': 1.788888888888889, 'l1_ratio': 1.0}
0.272 (+/-0.056) for {'alpha': 2.0, 'l1_ratio': 0.0}
0.272 (+/-0.055) for {'alpha': 2.0, 'l1_ratio': 0.1111111111111111}
0.271 (+/-0.054) for {'alpha': 2.0, 'l1_ratio': 0.22222222222222221}
0.271 (+/-0.052) for {'alpha': 2.0, 'l1_ratio': 0.33333333333333331}
0.270 (+/-0.052) for {'alpha': 2.0, 'l1_ratio': 0.44444444444444442}
0.270 (+/-0.053) for {'alpha': 2.0, 'l1_ratio': 0.55555555555555558}
0.269 (+/-0.053) for {'alpha': 2.0, 'l1_ratio': 0.66666666666666663}
0.268 (+/-0.053) for {'alpha': 2.0, 'l1_ratio': 0.77777777777777768}
0.268 (+/-0.053) for {'alpha': 2.0, 'l1_ratio': 0.88888888888888884}
0.267 (+/-0.053) for {'alpha': 2.0, 'l1_ratio': 1.0}
()



In [36]:

    
estimator2 = ElasticNet()

parameters2 = {
    'alpha': np.linspace(0.4, 0.6, 10, endpoint=True),
    'l1_ratio': np.linspace(0.4, 0.6, 10, endpoint=True)
}

enet2 = GridSearchCV(estimator2, parameters2)
enet2.fit(train_numdata, train_target)









    Out[36]:





GridSearchCV(cv=None, error_score='raise',
       estimator=ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True, l1_ratio=0.5,
      max_iter=1000, normalize=False, positive=False, precompute=False,
      random_state=None, selection='cyclic', tol=0.0001, warm_start=False),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'alpha': array([ 0.4    ,  0.42222,  0.44444,  0.46667,  0.48889,  0.51111,
        0.53333,  0.55556,  0.57778,  0.6    ]), 'l1_ratio': array([ 0.4    ,  0.42222,  0.44444,  0.46667,  0.48889,  0.51111,
        0.53333,  0.55556,  0.57778,  0.6    ])},
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
       verbose=0)



In [46]:

    
print(enet2.best_params_, enet2.best_score_)
print()
print("Grid scores on development set:")
print()
for params, mean_score, scores in enet2.grid_scores_:
    print("%0.3f (+/-%0.03f) for %r" % (mean_score, scores.std() * 2, params))
print()









    



({'alpha': 0.59999999999999998, 'l1_ratio': 0.48888888888888887}, 0.27276653788293181)
()
Grid scores on development set:
()
0.272 (+/-0.058) for {'alpha': 0.40000000000000002, 'l1_ratio': 0.40000000000000002}
0.272 (+/-0.058) for {'alpha': 0.40000000000000002, 'l1_ratio': 0.42222222222222222}
0.273 (+/-0.058) for {'alpha': 0.40000000000000002, 'l1_ratio': 0.44444444444444448}
0.273 (+/-0.058) for {'alpha': 0.40000000000000002, 'l1_ratio': 0.46666666666666667}
0.273 (+/-0.057) for {'alpha': 0.40000000000000002, 'l1_ratio': 0.48888888888888887}
0.273 (+/-0.057) for {'alpha': 0.40000000000000002, 'l1_ratio': 0.51111111111111107}
0.273 (+/-0.057) for {'alpha': 0.40000000000000002, 'l1_ratio': 0.53333333333333333}
0.273 (+/-0.057) for {'alpha': 0.40000000000000002, 'l1_ratio': 0.55555555555555558}
0.273 (+/-0.057) for {'alpha': 0.40000000000000002, 'l1_ratio': 0.57777777777777772}
0.273 (+/-0.057) for {'alpha': 0.40000000000000002, 'l1_ratio': 0.59999999999999998}
0.272 (+/-0.058) for {'alpha': 0.42222222222222222, 'l1_ratio': 0.40000000000000002}
0.273 (+/-0.058) for {'alpha': 0.42222222222222222, 'l1_ratio': 0.42222222222222222}
0.273 (+/-0.057) for {'alpha': 0.42222222222222222, 'l1_ratio': 0.44444444444444448}
0.273 (+/-0.057) for {'alpha': 0.42222222222222222, 'l1_ratio': 0.46666666666666667}
0.273 (+/-0.057) for {'alpha': 0.42222222222222222, 'l1_ratio': 0.48888888888888887}
0.273 (+/-0.057) for {'alpha': 0.42222222222222222, 'l1_ratio': 0.51111111111111107}
0.273 (+/-0.057) for {'alpha': 0.42222222222222222, 'l1_ratio': 0.53333333333333333}
0.273 (+/-0.057) for {'alpha': 0.42222222222222222, 'l1_ratio': 0.55555555555555558}
0.273 (+/-0.057) for {'alpha': 0.42222222222222222, 'l1_ratio': 0.57777777777777772}
0.273 (+/-0.057) for {'alpha': 0.42222222222222222, 'l1_ratio': 0.59999999999999998}
0.273 (+/-0.058) for {'alpha': 0.44444444444444448, 'l1_ratio': 0.40000000000000002}
0.273 (+/-0.057) for {'alpha': 0.44444444444444448, 'l1_ratio': 0.42222222222222222}
0.273 (+/-0.057) for {'alpha': 0.44444444444444448, 'l1_ratio': 0.44444444444444448}
0.273 (+/-0.057) for {'alpha': 0.44444444444444448, 'l1_ratio': 0.46666666666666667}
0.273 (+/-0.057) for {'alpha': 0.44444444444444448, 'l1_ratio': 0.48888888888888887}
0.273 (+/-0.057) for {'alpha': 0.44444444444444448, 'l1_ratio': 0.51111111111111107}
0.273 (+/-0.057) for {'alpha': 0.44444444444444448, 'l1_ratio': 0.53333333333333333}
0.273 (+/-0.057) for {'alpha': 0.44444444444444448, 'l1_ratio': 0.55555555555555558}
0.273 (+/-0.057) for {'alpha': 0.44444444444444448, 'l1_ratio': 0.57777777777777772}
0.273 (+/-0.056) for {'alpha': 0.44444444444444448, 'l1_ratio': 0.59999999999999998}
0.273 (+/-0.057) for {'alpha': 0.46666666666666667, 'l1_ratio': 0.40000000000000002}
0.273 (+/-0.057) for {'alpha': 0.46666666666666667, 'l1_ratio': 0.42222222222222222}
0.273 (+/-0.057) for {'alpha': 0.46666666666666667, 'l1_ratio': 0.44444444444444448}
0.273 (+/-0.057) for {'alpha': 0.46666666666666667, 'l1_ratio': 0.46666666666666667}
0.273 (+/-0.057) for {'alpha': 0.46666666666666667, 'l1_ratio': 0.48888888888888887}
0.273 (+/-0.057) for {'alpha': 0.46666666666666667, 'l1_ratio': 0.51111111111111107}
0.273 (+/-0.057) for {'alpha': 0.46666666666666667, 'l1_ratio': 0.53333333333333333}
0.273 (+/-0.056) for {'alpha': 0.46666666666666667, 'l1_ratio': 0.55555555555555558}
0.273 (+/-0.056) for {'alpha': 0.46666666666666667, 'l1_ratio': 0.57777777777777772}
0.273 (+/-0.056) for {'alpha': 0.46666666666666667, 'l1_ratio': 0.59999999999999998}
0.273 (+/-0.057) for {'alpha': 0.48888888888888887, 'l1_ratio': 0.40000000000000002}
0.273 (+/-0.057) for {'alpha': 0.48888888888888887, 'l1_ratio': 0.42222222222222222}
0.273 (+/-0.057) for {'alpha': 0.48888888888888887, 'l1_ratio': 0.44444444444444448}
0.273 (+/-0.057) for {'alpha': 0.48888888888888887, 'l1_ratio': 0.46666666666666667}
0.273 (+/-0.057) for {'alpha': 0.48888888888888887, 'l1_ratio': 0.48888888888888887}
0.273 (+/-0.057) for {'alpha': 0.48888888888888887, 'l1_ratio': 0.51111111111111107}
0.273 (+/-0.056) for {'alpha': 0.48888888888888887, 'l1_ratio': 0.53333333333333333}
0.273 (+/-0.056) for {'alpha': 0.48888888888888887, 'l1_ratio': 0.55555555555555558}
0.273 (+/-0.056) for {'alpha': 0.48888888888888887, 'l1_ratio': 0.57777777777777772}
0.273 (+/-0.056) for {'alpha': 0.48888888888888887, 'l1_ratio': 0.59999999999999998}
0.273 (+/-0.057) for {'alpha': 0.51111111111111107, 'l1_ratio': 0.40000000000000002}
0.273 (+/-0.057) for {'alpha': 0.51111111111111107, 'l1_ratio': 0.42222222222222222}
0.273 (+/-0.057) for {'alpha': 0.51111111111111107, 'l1_ratio': 0.44444444444444448}
0.273 (+/-0.057) for {'alpha': 0.51111111111111107, 'l1_ratio': 0.46666666666666667}
0.273 (+/-0.056) for {'alpha': 0.51111111111111107, 'l1_ratio': 0.48888888888888887}
0.273 (+/-0.056) for {'alpha': 0.51111111111111107, 'l1_ratio': 0.51111111111111107}
0.273 (+/-0.056) for {'alpha': 0.51111111111111107, 'l1_ratio': 0.53333333333333333}
0.273 (+/-0.056) for {'alpha': 0.51111111111111107, 'l1_ratio': 0.55555555555555558}
0.273 (+/-0.056) for {'alpha': 0.51111111111111107, 'l1_ratio': 0.57777777777777772}
0.273 (+/-0.056) for {'alpha': 0.51111111111111107, 'l1_ratio': 0.59999999999999998}
0.273 (+/-0.057) for {'alpha': 0.53333333333333333, 'l1_ratio': 0.40000000000000002}
0.273 (+/-0.057) for {'alpha': 0.53333333333333333, 'l1_ratio': 0.42222222222222222}
0.273 (+/-0.057) for {'alpha': 0.53333333333333333, 'l1_ratio': 0.44444444444444448}
0.273 (+/-0.056) for {'alpha': 0.53333333333333333, 'l1_ratio': 0.46666666666666667}
0.273 (+/-0.056) for {'alpha': 0.53333333333333333, 'l1_ratio': 0.48888888888888887}
0.273 (+/-0.056) for {'alpha': 0.53333333333333333, 'l1_ratio': 0.51111111111111107}
0.273 (+/-0.056) for {'alpha': 0.53333333333333333, 'l1_ratio': 0.53333333333333333}
0.273 (+/-0.056) for {'alpha': 0.53333333333333333, 'l1_ratio': 0.55555555555555558}
0.273 (+/-0.056) for {'alpha': 0.53333333333333333, 'l1_ratio': 0.57777777777777772}
0.273 (+/-0.056) for {'alpha': 0.53333333333333333, 'l1_ratio': 0.59999999999999998}
0.273 (+/-0.057) for {'alpha': 0.55555555555555558, 'l1_ratio': 0.40000000000000002}
0.273 (+/-0.057) for {'alpha': 0.55555555555555558, 'l1_ratio': 0.42222222222222222}
0.273 (+/-0.056) for {'alpha': 0.55555555555555558, 'l1_ratio': 0.44444444444444448}
0.273 (+/-0.056) for {'alpha': 0.55555555555555558, 'l1_ratio': 0.46666666666666667}
0.273 (+/-0.056) for {'alpha': 0.55555555555555558, 'l1_ratio': 0.48888888888888887}
0.273 (+/-0.056) for {'alpha': 0.55555555555555558, 'l1_ratio': 0.51111111111111107}
0.273 (+/-0.056) for {'alpha': 0.55555555555555558, 'l1_ratio': 0.53333333333333333}
0.273 (+/-0.056) for {'alpha': 0.55555555555555558, 'l1_ratio': 0.55555555555555558}
0.273 (+/-0.056) for {'alpha': 0.55555555555555558, 'l1_ratio': 0.57777777777777772}
0.273 (+/-0.056) for {'alpha': 0.55555555555555558, 'l1_ratio': 0.59999999999999998}
0.273 (+/-0.057) for {'alpha': 0.57777777777777772, 'l1_ratio': 0.40000000000000002}
0.273 (+/-0.056) for {'alpha': 0.57777777777777772, 'l1_ratio': 0.42222222222222222}
0.273 (+/-0.056) for {'alpha': 0.57777777777777772, 'l1_ratio': 0.44444444444444448}
0.273 (+/-0.056) for {'alpha': 0.57777777777777772, 'l1_ratio': 0.46666666666666667}
0.273 (+/-0.056) for {'alpha': 0.57777777777777772, 'l1_ratio': 0.48888888888888887}
0.273 (+/-0.056) for {'alpha': 0.57777777777777772, 'l1_ratio': 0.51111111111111107}
0.273 (+/-0.056) for {'alpha': 0.57777777777777772, 'l1_ratio': 0.53333333333333333}
0.273 (+/-0.056) for {'alpha': 0.57777777777777772, 'l1_ratio': 0.55555555555555558}
0.273 (+/-0.056) for {'alpha': 0.57777777777777772, 'l1_ratio': 0.57777777777777772}
0.273 (+/-0.056) for {'alpha': 0.57777777777777772, 'l1_ratio': 0.59999999999999998}
0.273 (+/-0.056) for {'alpha': 0.59999999999999998, 'l1_ratio': 0.40000000000000002}
0.273 (+/-0.056) for {'alpha': 0.59999999999999998, 'l1_ratio': 0.42222222222222222}
0.273 (+/-0.056) for {'alpha': 0.59999999999999998, 'l1_ratio': 0.44444444444444448}
0.273 (+/-0.056) for {'alpha': 0.59999999999999998, 'l1_ratio': 0.46666666666666667}
0.273 (+/-0.056) for {'alpha': 0.59999999999999998, 'l1_ratio': 0.48888888888888887}
0.273 (+/-0.056) for {'alpha': 0.59999999999999998, 'l1_ratio': 0.51111111111111107}
0.273 (+/-0.056) for {'alpha': 0.59999999999999998, 'l1_ratio': 0.53333333333333333}
0.273 (+/-0.056) for {'alpha': 0.59999999999999998, 'l1_ratio': 0.55555555555555558}
0.273 (+/-0.055) for {'alpha': 0.59999999999999998, 'l1_ratio': 0.57777777777777772}
0.273 (+/-0.055) for {'alpha': 0.59999999999999998, 'l1_ratio': 0.59999999999999998}
()



In [47]:

    
enet_predictions = pd.Series(enet.predict(test_numdata),
                             name='Elastic Net')
p_df = pd.DataFrame({'Enet Prediction': enet_predictions,
                     'Actual': test_target.values})

pylab.figure(figsize=(10, 10))
sns.jointplot('Actual', 'Enet Prediction', data=p_df, kind="hex",
              color=sns.color_palette()[2])









    Out[47]:





<seaborn.axisgrid.JointGrid at 0xb5742d0>






    





<matplotlib.figure.Figure at 0x7f35228703d0>



In [239]:

    
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
estimator = RandomForestRegressor()

parameters = {'n_estimators': (5, 10, 15, 20, 25, 30, 35),
              'max_depth': (3, 5, 7, 9, 11),
             }
rfr = GridSearchCV(estimator, parameters, n_jobs=3)
rfr.fit(train_this, train_target)

rfr_predictions = pd.Series(rfr.predict(test_this),
                            name='Random Forest')

p_df = pd.DataFrame({'RF Prediction': rfr_predictions,
                     'Actual': test_target.values})

pylab.figure(figsize=(10, 10))
sns.jointplot('Actual', 'RF Prediction', data=p_df, kind="hex",
              color=sns.color_palette()[3])









    Out[239]:





<seaborn.axisgrid.JointGrid at 0x29ce10d0>






    





<matplotlib.figure.Figure at 0x29ddc310>



In [240]:

    
metrics_report(mean_predictions,
               median_predictions,
               lr_predictions,
               enet_predictions,
               rfr_predictions)









    Out[240]:






  
    
      
      Explained Variance
      MAE
      MSE
      MedAE
      R2
    
    
      name
      
      
      
      
      
    
  
  
    
      Mean Response
      0.000000
      5.916924
      54.151808
      5.627266
      -0.000034
    
    
      Median Response
      0.000000
      5.746785
      57.647159
      4.800000
      -0.064583
    
    
      Linear Regression
      0.335165
      4.444201
      36.212719
      3.399445
      0.331251
    
    
      Elastic Net
      0.354220
      4.635088
      34.990091
      4.025681
      0.353830
    
    
      Random Forest
      0.323904
      4.656359
      36.613345
      3.818018
      0.323853



In [241]:

    
lr_diffs = lr_predictions - test_target
lr_diffs.name = 'LinearRegression Error'
rfr_diffs = rfr_predictions - test_target
rfr_diffs.name = 'RandomForest Error'

sns.jointplot(lr_predictions, rfr_predictions, kind='resid', color=sns.color_palette()[4])









    Out[241]:





<seaborn.axisgrid.JointGrid at 0x29df3350>



In [242]:

    
from sklearn.linear_model import Lasso
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

degree = 2

model = make_pipeline(PolynomialFeatures(degree), Lasso())
model.fit(train_numdata, train_target)

poly_preds = pd.Series(model.predict(test_numdata),
                       name='Polynomial Lasso',
                       index=test_target.index)


sns.jointplot(test_target, 
              poly_preds,
              kind='resid',
              color=sns.color_palette()[5])









    Out[242]:





<seaborn.axisgrid.JointGrid at 0x20b0acd0>



In [243]:

    
metrics_report(mean_predictions,
               median_predictions,
               lr_predictions,
               enet_predictions,
               rfr_predictions,
               poly_preds)









    Out[243]:






  
    
      
      Explained Variance
      MAE
      MSE
      MedAE
      R2
    
    
      name
      
      
      
      
      
    
  
  
    
      Mean Response
      0.000000
      5.916924
      54.151808
      5.627266
      -0.000034
    
    
      Median Response
      0.000000
      5.746785
      57.647159
      4.800000
      -0.064583
    
    
      Linear Regression
      0.335165
      4.444201
      36.212719
      3.399445
      0.331251
    
    
      Elastic Net
      0.354220
      4.635088
      34.990091
      4.025681
      0.353830
    
    
      Random Forest
      0.323904
      4.656359
      36.613345
      3.818018
      0.323853
    
    
      Polynomial Lasso
      0.275439
      4.810340
      39.327519
      4.009546
      0.273730



In [ ]:

	player_id	full_name	position	team	week	fanduel_points	opponent	home_team	away_team	prev_fanduel_points	...	prev_rushing_att	mean_rushing_att	prev_rushing_loss	mean_rushing_loss	prev_rushing_loss_yds	mean_rushing_loss_yds	prev_rushing_tds	mean_rushing_tds	prev_rushing_yds	mean_rushing_yds
0	00-0019596	Tom Brady	QB	NE	1	27.62	PIT	NE	PIT	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	00-0020712	James Harrison	OLB	PIT	1	0.00	NE	NE	PIT	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	00-0022874	Josh Scobee	K	NO	1	8.00	PIT	NE	PIT	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	00-0022900	Will Allen	UNK	UNK	1	0.00	PIT	NE	PIT	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	00-0022924	Ben Roethlisberger	QB	PIT	1	17.04	NE	NE	PIT	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	player_id	full_name	position	team	week	fanduel_points	opponent	home_team	away_team	prev_fanduel_points	...	prev_rushing_att	mean_rushing_att	prev_rushing_yds	mean_rushing_yds
1021	00-0010346	Peyton Manning	UNK	UNK	2	21.24	DEN	KC	DEN	5.90	...	1	1	-1	-1
1022	00-0022793	Antonio Smith	UNK	UNK	2	0.00	DEN	KC	DEN	0.00	...	0	0	0	0
1023	00-0023436	Alex Smith	QB	KC	2	7.14	DEN	KC	DEN	23.22	...	9	9	15	15
1024	00-0023445	DeMarcus Ware	OLB	DEN	2	0.00	KC	KC	DEN	0.00	...	0	0	0	0
1025	00-0023449	Derrick Johnson	ILB	KC	2	0.00	DEN	KC	DEN	0.00	...	0	0	0	0

	0	1	2	3	4	5	6	7	8	9	...	560	561	562	563	564	565	566	567	568	569
count	3160.000000	3160.000000	3160.000000	3160.000000	3160.000000	3160.000000	3160.000000	3160.000000	3160.000000	3160.000000	...	3160.000000	3160.000000	3160.000000	3160.000000	3160.000000	3160.000000	3160.000000	3160.000000	3160.000000	3160.000000
mean	0.034810	0.028797	0.032278	0.033861	0.028165	0.035443	0.029747	0.029430	0.037975	0.030696	...	0.037658	0.039241	0.029114	0.033544	0.031962	0.029747	0.028797	0.044304	0.037975	9.587658
std	0.183328	0.167263	0.176767	0.180899	0.165469	0.184926	0.169915	0.169036	0.191165	0.172521	...	0.190398	0.194197	0.168152	0.180081	0.175927	0.169915	0.167263	0.205802	0.191165	4.691149
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	2.000000
25%	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	5.000000
50%	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	10.000000
75%	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	14.000000
max	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	...	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	17.000000

	Explained Variance	MAE	MSE	MedAE	R2
name
Linear Regression	0.335165	4.444197	36.212686	3.399439	0.331252

	Explained Variance	MAE	MSE	MedAE	R2
name
Mean Response	0.000000	5.916924	54.151808	5.627266	-0.000034
Median Response	0.000000	5.746785	57.647159	4.800000	-0.064583
Linear Regression	0.335165	4.444197	36.212686	3.399439	0.331252