We tried different regressor model, like GBR, SVM, MLP, Random Forest and KNN as recommanded by the winning team of the Kaggle on taxi trajectories. So far GBR seems to be the best, slightly better than the RF.
The new features we exctracted only made a small impact on predictions but still improved them consistently.
We tried to use a LSTM to take advantage of the sequential structure of the data but it didn't work too well, probably because there is not enought data (13M lines divided per the average length of sequences (15), less the 30% of fully empty data)
In [2]:
# from __future__ import exam_success
from __future__ import absolute_import
from __future__ import print_function
# Standard imports
%matplotlib inline
import os
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import random
import pandas as pd
import scipy.stats as stats
# Sk cheats
from sklearn.cross_validation import cross_val_score
from sklearn import grid_search
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
#from sklearn.preprocessing import Imputer # get rid of nan
from sklearn.decomposition import NMF # to add features based on the latent representation
from sklearn.decomposition import ProjectedGradientNMF
# Faster gradient boosting
import xgboost as xgb
# For neural networks models
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD, RMSprop
Predictions is made in the USA corn growing states (mainly Iowa, Illinois, Indiana) during the season with the highest rainfall (as illustrated by Iowa for the april to august months)
The Kaggle page indicate that the dataset have been shuffled, so working on a subset seems acceptable
The test set is not a extracted from the same data as the training set however, which make the evaluation trickier
In [3]:
%%time
#filename = "data/train.csv"
filename = "data/reduced_train_100000.csv"
#filename = "data/reduced_train_1000000.csv"
raw = pd.read_csv(filename)
raw = raw.set_index('Id')
In [4]:
raw.columns
Out[4]:
In [5]:
raw['Expected'].describe()
Out[5]:
Per wikipedia, a value of more than 421 mm/h is considered "Extreme/large hail"
If we encounter the value 327.40 meter per hour, we should probably start building Noah's ark
Therefor, it seems reasonable to drop values too large, considered as outliers
In [6]:
# Considering that the gauge may concentrate the rainfall, we set the cap to 1000
# Comment this line to analyse the complete dataset
l = len(raw)
raw = raw[raw['Expected'] < 300] #1000
print("Dropped %d (%0.2f%%)"%(l-len(raw),(l-len(raw))/float(l)*100))
Our data look like this:
In [7]:
raw.head(10)
Out[7]:
In [8]:
raw.describe()
Out[8]:
Numerous features, with a lot of variability.
We regroup the data by ID
In [10]:
# We select all features except for the minutes past,
# because we ignore the time repartition of the sequence for now
features_columns = list([u'Ref', u'Ref_5x5_10th',
u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
u'Kdp_5x5_50th', u'Kdp_5x5_90th'])
def getXy(raw):
selected_columns = list([ u'minutes_past',u'radardist_km', u'Ref', u'Ref_5x5_10th',
u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
u'Kdp_5x5_50th', u'Kdp_5x5_90th'])
data = raw[selected_columns]
docX, docY = [], []
for i in data.index.unique():
if isinstance(data.loc[i],pd.core.series.Series):
m = [data.loc[i].as_matrix()]
docX.append(m)
docY.append(float(raw.loc[i]["Expected"]))
else:
m = data.loc[i].as_matrix()
docX.append(m)
docY.append(float(raw.loc[i][:1]["Expected"]))
X , y = np.array(docX) , np.array(docY)
return X,y
In [11]:
#noAnyNan = raw.loc[raw[features_columns].dropna(how='any').index.unique()]
noAnyNan = raw.dropna()
In [12]:
noFullNan = raw.loc[raw[features_columns].dropna(how='all').index.unique()]
Can't do much with those, therefor we isolate them and decide what value to predict using the data analysis we made during the Data Visualization
In [13]:
fullNan = raw.drop(raw[features_columns].dropna(how='all').index)
In [14]:
print("Complete dataset:",len(raw))
print("Fully filled: ",len(noAnyNan))
print("Partly filled: ",len(noFullNan))
print("Fully empty: ",len(fullNan))
In [15]:
%%time
#X,y=getXy(noAnyNan)
X,y=getXy(noFullNan)
Our data still look the same:
In [16]:
X[0][:10,:8]
Out[16]:
Now let's add some features.
We start by averaging on each row every sequence of measures, and we use a Non-negative matrix factorisation to produce a new feature indicating the location of each measures in a latent space. Hopefully this will allow us to predict similar sequences the same way.
("RuntimeWarning: Mean of empty slice" is normal)
In [18]:
# used to fill fully empty datas
global_means = np.nanmean(noFullNan,0)
XX=[]
for t in X:
nm = np.nanmean(t,0)
for idx,j in enumerate(nm):
if np.isnan(j):
nm[idx]=global_means[idx]
XX.append(nm)
XX=np.array(XX)
# rescale to clip min at 0 (for non negative matrix factorization)
XX_rescaled=XX[:,:]-np.min(XX,0)
In [19]:
%%time
nmf = NMF(max_iter=5000)
W = nmf.fit_transform(XX_rescaled)
#H = nn.components_
Now we add other simpler features:
In [20]:
# reduce the sequence structure of the data and produce
# new hopefully informatives features
def addFeatures(X,mf=0):
# used to fill fully empty datas
#global_means = np.nanmean(X,0)
XX=[]
nbFeatures=float(len(X[0][0]))
for idxt,t in enumerate(X):
# compute means, ignoring nan when possible, marking it when fully filled with nan
nm = np.nanmean(t,0)
tt=[]
for idx,j in enumerate(nm):
if np.isnan(j):
nm[idx]=global_means[idx]
tt.append(1)
else:
tt.append(0)
tmp = np.append(nm,np.append(tt,tt.count(0)/nbFeatures))
# faster if working on fully filled data:
#tmp = np.append(np.nanmean(np.array(t),0),(np.array(t)[1:] - np.array(t)[:-1]).sum(0) )
# add the percentiles
tmp = np.append(tmp,np.nanpercentile(t,10,axis=0))
tmp = np.append(tmp,np.nanpercentile(t,50,axis=0))
tmp = np.append(tmp,np.nanpercentile(t,90,axis=0))
for idx,i in enumerate(tmp):
if np.isnan(i):
tmp[idx]=0
# adding the dbz as a feature
test = t
try:
taa=test[:,0]
except TypeError:
taa=[test[0][0]]
valid_time = np.zeros_like(taa)
valid_time[0] = taa[0]
for n in xrange(1,len(taa)):
valid_time[n] = taa[n] - taa[n-1]
valid_time[-1] = valid_time[-1] + 60 - np.sum(valid_time)
valid_time = valid_time / 60.0
sum=0
try:
column_ref=test[:,2]
except TypeError:
column_ref=[test[0][2]]
for dbz, hours in zip(column_ref, valid_time):
# See: https://en.wikipedia.org/wiki/DBZ_(meteorology)
if np.isfinite(dbz):
mmperhr = pow(pow(10, dbz/10)/200, 0.625)
sum = sum + mmperhr * hours
if not(mf is 0):
tmp = np.append(tmp,mf[idxt])
XX.append(np.append(np.array(sum),tmp))
#XX.append(np.array([sum]))
#XX.append(tmp)
return np.array(XX)
In [21]:
%%time
XX=addFeatures(X,mf=W)
#XX=addFeatures(X)
We are now ready to train our models, so we prepare some training and evaluation dataset.
In [22]:
def splitTrainTest(X, y, split=0.2):
tmp1, tmp2 = [], []
ps = int(len(X) * (1-split))
index_shuf = range(len(X))
random.shuffle(index_shuf)
for i in index_shuf:
tmp1.append(X[i])
tmp2.append(y[i])
return tmp1[:ps], tmp2[:ps], tmp1[ps:], tmp2[ps:]
In [23]:
X_train,y_train, X_test, y_test = splitTrainTest(XX,y)
In [24]:
# used for the crossvalidation
def manualScorer(estimator, X, y):
err = (estimator.predict(X_test)-y_test)**2
return -err.sum()/len(err)
In [25]:
svr = SVR(kernel='rbf', C=600.0)
We tried a lot of parameters but here for demonstration purpose and speed we focus on one.
In [46]:
%%time
parameters = {'C':range(600,1001,200)}
grid_svr = grid_search.GridSearchCV(svr, parameters,scoring=manualScorer)
grid_svr.fit(X_train,y_train)
print(grid_svr.grid_scores_)
print("Best: ",grid_svr.best_params_)
We use the best parameters
In [26]:
%%time
srv = svr.fit(X_train,y_train)
In [27]:
print(svr.score(X_train,y_train))
print(svr.score(X_test,y_test))
In [66]:
#np.shape(svr.support_vectors_)
#svr.support_vectors_.mean(0)
In [50]:
%%time
svr_score = cross_val_score(svr, X_train, y_train, cv=5)
print("Score: %s\nMean: %.03f"%(svr_score,svr_score.mean()))
In [68]:
knn = KNeighborsRegressor(n_neighbors=6,weights='distance',algorithm='ball_tree')
In [29]:
#parameters = {'weights':('distance','uniform'),'algorithm':('auto', 'ball_tree', 'kd_tree', 'brute')}
parameters = {'n_neighbors':range(1,10,1)}
grid_knn = grid_search.GridSearchCV(knn, parameters,scoring=manualScorer)
In [30]:
%%time
grid_knn.fit(X_train,y_train)
Out[30]:
In [31]:
print(grid_knn.grid_scores_)
print("Best: ",grid_knn.best_params_)
In [32]:
knn = grid_knn.best_estimator_
In [78]:
knn= knn.fit(X_train,y_train)
In [83]:
print(knn.score(X_train,y_train))
print(knn.score(X_test,y_test))
In [84]:
etreg = ExtraTreesRegressor(n_estimators=200, max_depth=None, min_samples_split=1, random_state=0)
In [92]:
parameters = {'n_estimators':range(100,200,20)}
grid_rf = grid_search.GridSearchCV(etreg, parameters,n_jobs=2,scoring=manualScorer)
In [93]:
%%time
grid_rf.fit(X_train,y_train)
Out[93]:
In [94]:
print(grid_rf.grid_scores_)
print("Best: ",grid_rf.best_params_)
In [135]:
#etreg = grid_rf.best_estimator_
In [87]:
%%time
etreg = etreg.fit(X_train,y_train)
In [88]:
print(etreg.score(X_train,y_train))
print(etreg.score(X_test,y_test))
In [89]:
rfr = RandomForestRegressor(n_estimators=200, criterion='mse', max_depth=None, min_samples_split=2,
min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto',
max_leaf_nodes=None, bootstrap=True, oob_score=False, n_jobs=-1,
random_state=None, verbose=0, warm_start=False)
In [90]:
%%time
rfr = rfr.fit(X_train,y_train)
In [91]:
print(rfr.score(X_train,y_train))
print(rfr.score(X_test,y_test))
In [141]:
# the dbz feature does not influence xgbr so much
xgbr = xgb.XGBRegressor(max_depth=6, learning_rate=0.1, n_estimators=700, silent=True,
objective='reg:linear', nthread=-1, gamma=0, min_child_weight=1,
max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5,
seed=0, missing=None)
In [142]:
%%time
xgbr = xgbr.fit(X_train,y_train)
In [119]:
# without the nmf features
# print(xgbr.score(X_train,y_train))
## 0.993948231144
# print(xgbr.score(X_test,y_test))
## 0.613931733332
In [143]:
# with nmf features
print(xgbr.score(X_train,y_train))
print(xgbr.score(X_test,y_test))
In [144]:
gbr = GradientBoostingRegressor(loss='ls', learning_rate=0.1, n_estimators=900,
subsample=1.0, min_samples_split=2, min_samples_leaf=1, max_depth=4, init=None,
random_state=None, max_features=None, alpha=0.5,
verbose=0, max_leaf_nodes=None, warm_start=False)
In [145]:
%%time
gbr = gbr.fit(X_train,y_train)
#os.system('say "終わりだ"') #its over!
In [82]:
#parameters = {'max_depth':range(2,5,1),'alpha':[0.5,0.6,0.7,0.8,0.9]}
#parameters = {'subsample':[0.2,0.4,0.5,0.6,0.8,1]}
#parameters = {'subsample':[0.2,0.5,0.6,0.8,1],'n_estimators':[800,1000,1200]}
#parameters = {'max_depth':range(2,4,1)}
parameters = {'n_estimators':[400,800,1100]}
#parameters = {'loss':['ls', 'lad', 'huber', 'quantile'],'alpha':[0.3,0.5,0.8,0.9]}
#parameters = {'learning_rate':[0.1,0.5,0.9]}
grid_gbr = grid_search.GridSearchCV(gbr, parameters,n_jobs=2,scoring=manualScorer)
In [83]:
%%time
grid_gbr = grid_gbr.fit(X_train,y_train)
In [84]:
print(grid_gbr.grid_scores_)
print("Best: ",grid_gbr.best_params_)
In [146]:
print(gbr.score(X_train,y_train))
print(gbr.score(X_test,y_test))
In [244]:
modelList = [svr,knn,etreg,rfr,xgbr,gbr]
reducedModelList = [knn,etreg,xgbr,gbr]
In [214]:
score_train = [[str(f).split("(")[0],f.score(X_train,y_train)] for f in modelList]
score_test = [[str(f).split("(")[0],f.score(X_test,y_test)] for f in modelList]
for idx,i in enumerate(score_train):
print(i[0])
print(" train: %.03f"%i[1])
print(" test: %.03f"%score_test[idx][1])
In [245]:
#reducedModelList = [knn,etreg,xgbr,gbr]
globalPred = np.array([f.predict(XX) for f in reducedModelList]).T
#globalPred.mean(1)
In [226]:
globalPred[0]
Out[226]:
In [227]:
y[0]
Out[227]:
In [246]:
err = (globalPred.mean(1)-y)**2
print(err.sum()/len(err))
In [232]:
err = (globalPred.mean(1)-y)**2
print(err.sum()/len(err))
In [242]:
for f in modelList:
print(str(f).split("(")[0])
err = (f.predict(XX)-y)**2
print(err.sum()/len(err))
In [266]:
err = (XX[:,0]-y)**2
print(err.sum()/len(err))
In [243]:
for f in modelList:
print(str(f).split("(")[0])
print(f.score(XX,y))
In [ ]:
XX[:10,0] # feature 0 is marshall-palmer
In [238]:
svrMeta = SVR()
In [239]:
%%time
svrMeta = svrMeta.fit(globalPred,y)
In [241]:
err = (svrMeta.predict(globalPred)-y)**2
print(err.sum()/len(err))
Here for legacy
In [41]:
in_dim = len(XX[0])
out_dim = 1
model = Sequential()
# Dense(64) is a fully-connected layer with 64 hidden units.
# in the first layer, you must specify the expected input data shape:
# here, 20-dimensional vectors.
model.add(Dense(128, input_shape=(in_dim,)))
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Dense(1, init='uniform'))
model.add(Activation('linear'))
#sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
#model.compile(loss='mean_squared_error', optimizer=sgd)
rms = RMSprop()
model.compile(loss='mean_squared_error', optimizer=rms)
#model.fit(X_train, y_train, nb_epoch=20, batch_size=16)
#score = model.evaluate(X_test, y_test, batch_size=16)
In [42]:
prep = []
for i in y_train:
prep.append(min(i,20))
In [43]:
prep=np.array(prep)
mi,ma = prep.min(),prep.max()
fy = (prep-mi) / (ma-mi)
#my = fy.max()
#fy = fy/fy.max()
In [44]:
model.fit(np.array(X_train), fy, batch_size=10, nb_epoch=10, validation_split=0.1)
Out[44]:
In [45]:
pred = model.predict(np.array(X_test))*ma+mi
In [46]:
err = (pred-y_test)**2
err.sum()/len(err)
Out[46]:
In [ ]:
r = random.randrange(len(X_train))
print("(Train) Prediction %0.4f, True: %0.4f"%(model.predict(np.array([X_train[r]]))[0][0]*ma+mi,y_train[r]))
r = random.randrange(len(X_test))
print("(Test) Prediction %0.4f, True: %0.4f"%(model.predict(np.array([X_test[r]]))[0][0]*ma+mi,y_test[r]))
In [338]:
%%time
filename = "data/reduced_test_5000.csv"
#filename = "data/test.csv"
test = pd.read_csv(filename)
test = test.set_index('Id')
In [339]:
features_columns = list([u'Ref', u'Ref_5x5_10th',
u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
u'Kdp_5x5_50th', u'Kdp_5x5_90th'])
def getX(raw):
selected_columns = list([ u'minutes_past',u'radardist_km', u'Ref', u'Ref_5x5_10th',
u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
u'Kdp_5x5_50th', u'Kdp_5x5_90th'])
data = raw[selected_columns]
docX= []
for i in data.index.unique():
if isinstance(data.loc[i],pd.core.series.Series):
m = [data.loc[i].as_matrix()]
docX.append(m)
else:
m = data.loc[i].as_matrix()
docX.append(m)
X = np.array(docX)
return X
In [340]:
#%%time
#X=getX(test)
#tmp = []
#for i in X:
# tmp.append(len(i))
#tmp = np.array(tmp)
#sns.countplot(tmp,order=range(tmp.min(),tmp.max()+1))
#plt.title("Number of ID per number of observations\n(On test dataset)")
#plt.plot()
In [341]:
#testFull = test.dropna()
testNoFullNan = test.loc[test[features_columns].dropna(how='all').index.unique()]
In [342]:
%%time
X=getX(testNoFullNan) # 1min
#XX = [np.array(t).mean(0) for t in X] # 10s
In [ ]:
In [343]:
XX=[]
for t in X:
nm = np.nanmean(t,0)
for idx,j in enumerate(nm):
if np.isnan(j):
nm[idx]=global_means[idx]
XX.append(nm)
XX=np.array(XX)
# rescale to clip min at 0 (for non negative matrix factorization)
XX_rescaled=XX[:,:]-np.min(XX,0)
In [344]:
%%time
W = nmf.transform(XX_rescaled)
In [345]:
XX=addFeatures(X,mf=W)
In [346]:
pd.DataFrame(xgbr.predict(XX)).describe()
Out[346]:
In [348]:
reducedModelList = [knn,etreg,xgbr,gbr]
globalPred = np.array([f.predict(XX) for f in reducedModelList]).T
predTest = globalPred.mean(1)
In [100]:
predFull = zip(testNoFullNan.index.unique(),predTest)
In [101]:
testNan = test.drop(test[features_columns].dropna(how='all').index)
In [ ]:
pred = predFull + predNan
In [102]:
tmp = np.empty(len(testNan))
tmp.fill(0.445000) # 50th percentile of full Nan dataset
predNan = zip(testNan.index.unique(),tmp)
In [103]:
testLeft = test.drop(testNan.index.unique()).drop(testFull.index.unique())
In [104]:
tmp = np.empty(len(testLeft))
tmp.fill(1.27) # 50th percentile of full Nan dataset
predLeft = zip(testLeft.index.unique(),tmp)
In [105]:
len(testFull.index.unique())
Out[105]:
In [106]:
len(testNan.index.unique())
Out[106]:
In [107]:
len(testLeft.index.unique())
Out[107]:
In [108]:
pred = predFull + predNan + predLeft
In [113]:
pred.sort(key=lambda x: x[0], reverse=False)
In [ ]:
#reducedModelList = [knn,etreg,xgbr,gbr]
globalPred = np.array([f.predict(XX) for f in reducedModelList]).T
#globalPred.mean(1)
In [114]:
submission = pd.DataFrame(pred)
submission.columns = ["Id","Expected"]
submission.head()
Out[114]:
In [115]:
submission.loc[submission['Expected']<0,'Expected'] = 0.445
In [116]:
submission.to_csv("submit4.csv",index=False)
In [ ]:
In [73]:
filename = "data/sample_solution.csv"
sol = pd.read_csv(filename)
In [74]:
sol
Out[74]:
In [ ]:
ss = np.array(sol)
In [ ]:
%%time
for a,b in predFull:
ss[a-1][1]=b
In [ ]:
ss
In [75]:
sub = pd.DataFrame(pred)
sub.columns = ["Id","Expected"]
sub.Id = sub.Id.astype(int)
sub.head()
Out[75]:
In [76]:
sub.to_csv("submit3.csv",index=False)
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]: