Huzzah. Our new competition predicting fog water collection from weather data is up! Let's put on our DATA SCIENCE HATS and get to work.
Here's what we're doing right now: looking at the microclimate measurements as independent snapshots in time as predictors of the outcome variable.
Here's what we're NOT doing right now:
Here's a new term for y'all courtesy of the DrivenData team:
Somebody throw it on Urban Dictionary — let's make fetch happen. Okay, time to load up the tools of the trade.
In [1]:
%matplotlib inline
from __future__ import print_function
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
For this benchmark, we are only going to use microclimate data.
In [2]:
microclimate_train = pd.read_csv('data/eaa4fe4a-b85f-4088-85ee-42cabad25c81.csv',
index_col=0, parse_dates=[0])
microclimate_test = pd.read_csv('data/fb38df29-e4b7-4331-862c-869fac984cfa.csv',
index_col=0, parse_dates=[0])
labels = pd.read_csv('data/a0f785bc-e8c7-4253-8e8a-8a1cd0441f73.csv',
index_col=0, parse_dates=[0])
submission_format = pd.read_csv('data/submission_format.csv',
index_col=0, parse_dates=[0])
First of all, let's just "walk around" this data a bit, and use the common summary functions to get a better idea for what it looks like.
In [3]:
microclimate_train.shape
Out[3]:
In [4]:
microclimate_train.describe()
Out[4]:
In [31]:
microclimate_train.tail()
Out[31]:
In [6]:
microclimate_train.isnull().sum(axis=0)
Out[6]:
Here's the big takeway: we will have to deal with some missing values.
In [7]:
microclimate_test.shape
Out[7]:
In [8]:
microclimate_test.describe()
Out[8]:
In [10]:
microclimate_train.isnull().sum(axis=0)
Out[10]:
So again, we will definitely need to pay attention to missing values and figure out smart ways to deal with that. File under: #datascientistproblems.
Let's also plot all the data points in the training and test data to get a sense for how the data set is split:
In [11]:
fig, axs = plt.subplots(nrows=microclimate_train.shape[1], ncols=1, sharex=True, figsize=(16, 18))
columns = microclimate_train.columns
for i, ax in list(enumerate(axs)):
col = columns[i]
ax.plot_date(microclimate_train.index, microclimate_train[col], ms=1.5, label='train')
ax.plot_date(microclimate_test.index, microclimate_test[col], ms=1.5, color='r', label='test')
ax.set_ylabel(col)
if i == 0:
ax.legend(loc='upper right', markerscale=10, fontsize='xx-large')
Here's another fun insight: this problem has a time component and in the real world we are trying to predict the future. That is, we're trying to figure out the upcoming yield based on current weather. For those of us concerned about overfitting (hint: all of us), we will need to think hard about our modeling assumptions.
So, things that we could do all willy nilly but probably shouldn't:
But this is a benchmark and we're not all about rules on this blog. Watch us break every single one of these cautionary warnings below! (That's why they call it a benchmark. (Actually that statement doesn't make sense, we don't know why it's called a benchmark. (HOLY MOLY, TOO MANY PARENTHESES #inception #common-lisp)))
In [12]:
print('train', microclimate_train.shape)
print('labels', labels.shape)
In [13]:
microclimate_train.columns.tolist()
Out[13]:
In [14]:
wanted_cols = [u'percip_mm', u'humidity', u'temp', u'leafwet450_min', u'leafwet_lwscnt', u'wind_ms']
wanted = microclimate_train[wanted_cols].copy().dropna()
wanted['yield'] = labels['yield']
sns.pairplot(wanted, diag_kind='kde')
plt.show()
We can see some relationships starting to emerge here, but maybe not as directly correlated with yield as we may have hoped.
Quick nomenclature check: we are training our model on the competition's training data, but we still need to use some cross validation techniques to avoid overfitting the heck out of the data.
We'll go ahead and split up the competition training set into its own training and test sets for our modeling purposes here.
In [15]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(microclimate_train,
labels.values.ravel(),
test_size=0.3)
Here's a scikit-learn pipeline which will handle a couple tasks for us in a well defined, repeatable way:
And we'll GRID SEARCH ALL THE THINGS, because, hey, why not, right???!1
In [16]:
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import Imputer
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
steps = [('imputer', Imputer()),
('pca', PCA()),
('rf', RandomForestRegressor())]
pipe = Pipeline(steps)
# create the grid search
params = {
'pca__n_components': range(2, X_train.shape[1]),
'imputer__strategy': ['mean', 'median', 'most_frequent'],
'rf__n_estimators': [5, 10, 20]
}
estimator = GridSearchCV(pipe, param_grid=params, n_jobs=-1, verbose=1)
estimator.fit(X_train, y_train.ravel())
Out[16]:
In [17]:
estimator.best_params_
Out[17]:
The competition uses root mean squared error as the metric, so let's see how we do on that front.
In [18]:
from sklearn.metrics import mean_squared_error
y_hat = estimator.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_hat))
rmse
Out[18]:
We can also plot our actuals against our predicted values and see if there's a trend.
In [38]:
fig, ax = plt.subplots(figsize=(8, 8))
plt.scatter(y_test, y_hat)
plt.xlabel('actual', fontsize=20)
plt.ylabel('predicted', fontsize=20)
plt.plot(np.linspace(0, 35), np.linspace(0, 35), label="$y=x$")
plt.xlim(0, 35)
plt.ylim(0, 35)
plt.legend(loc='upper left', fontsize=20)
plt.show()
Let's look at the residuals by themselves:
In [20]:
fig, ax = plt.subplots(figsize=(16, 4))
err = y_test - y_hat
ax.plot_date(X_test.index, err, c='r', ms=3)
ax.set_title('residuals on test data (each)', fontsize=20)
ax.set_ylabel('error')
plt.show()
And also look at the distribution of residuals.
In [21]:
fig, ax = plt.subplots(figsize=(16, 4))
plt.hist(err, bins=20, normed=True)
plt.title('residuals on test data (distribution)', fontsize=20)
plt.xlim(-20, 20)
plt.show()
So we have our predictions in hand, and we have some idea of the error (although in practice this number only makes sense compared to other submissions — yet another great reason to give people a benchmark even with a very naive model!)
Time to make this into a submission.
So we know we need to predict the labels from the training inputs we were given, but let's just double check how it's supposed to look.
In [22]:
!head data/submission_format.csv
In [23]:
submission_format.head()
Out[23]:
In [24]:
submission_format.dtypes
Out[24]:
In [25]:
submission_format.shape
Out[25]:
In [26]:
microclimate_test.shape
Out[26]:
We'll do a quick and dirty mitigation by making sure the test data we pass into our pipline has an input row for each and every expected row in the submission format, even if they are all NaNs.
An easy way to get the right index for our purposes is a left outer join, where the "left hand" table is our submission format and the "right hand table" is the microclimate data we have. The left join will link every row that has an identical index in both sets, but will fill the rows in microclimate that do not exist as NaNs.
Here's a helpful figure from the pandas docs:
In [27]:
test = submission_format.join(microclimate_test, how='left') # left join onto the format
test = test[microclimate_test.columns] # now just subset back down to the input columns
assert (test.index == submission_format.index).all()
We'll make predictions and fill in the yield column with actual outputs from our model:
In [28]:
submission_format['yield'] = estimator.predict(test)
Now we'll write it out to a file:
In [29]:
submission_format.to_csv("first_submission.csv")
And we can submit it to the competition and see what we end up with:
Ouch. The overfitting police have come to calling. Note how we ended up scoring much more poorly on the withheld evaluation data than on our own withheld test set. That's a clear indication we overfit the data.
But the gauntlet has been thrown down. Will you step up to beat this ... pettifogging model?