FDMS TME3

Kaggle How Much Did It Rain? II

Florian Toque & Paul Willot

Notes

We tried different regressor model, like GBR, SVM, MLP, Random Forest and KNN as recommanded by the winning team of the Kaggle on taxi trajectories. So far GBR seems to be the best, slightly better than the RF.
The new features we exctracted only made a small impact on predictions but still improved them consistently.
We tried to use a LSTM to take advantage of the sequential structure of the data but it didn't work too well, probably because there is not enought data (13M lines divided per the average length of sequences (15), less the 30% of fully empty data)


In [321]:
# from __future__ import exam_success
from __future__ import absolute_import 
from __future__ import print_function  

# Standard imports
%matplotlib inline
import os
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import random
import pandas as pd
import scipy.stats as stats

# Sk cheats
from sklearn.cross_validation import cross_val_score
from sklearn import grid_search
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
#from sklearn.preprocessing import Imputer   # get rid of nan
from sklearn.decomposition import NMF        # to add features based on the latent representation
from sklearn.decomposition import ProjectedGradientNMF

# Faster gradient boosting
import xgboost as xgb

# For neural networks models
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation 
from keras.optimizers import SGD, RMSprop
  • 13.765.202 lines in train.csv
  • 8.022.757 lines in test.csv

Few words about the dataset

Predictions is made in the USA corn growing states (mainly Iowa, Illinois, Indiana) during the season with the highest rainfall (as illustrated by Iowa for the april to august months)

The Kaggle page indicate that the dataset have been shuffled, so working on a subset seems acceptable
The test set is not a extracted from the same data as the training set however, which make the evaluation trickier

Load the dataset


In [322]:
%%time
#filename = "data/train.csv"
filename = "data/reduced_train_100000.csv"
#filename = "data/reduced_train_1000000.csv"
raw = pd.read_csv(filename)
raw = raw.set_index('Id')


CPU times: user 245 ms, sys: 47 ms, total: 292 ms
Wall time: 296 ms

In [323]:
raw.columns


Out[323]:
Index([u'minutes_past', u'radardist_km', u'Ref', u'Ref_5x5_10th',
       u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
       u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
       u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
       u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
       u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
       u'Kdp_5x5_50th', u'Kdp_5x5_90th', u'Expected'],
      dtype='object')

In [324]:
raw['Expected'].describe()


Out[324]:
count    100000.000000
mean        129.579825
std         687.622542
min           0.010000
25%           0.254000
50%           1.016000
75%           3.556002
max       32740.617000
Name: Expected, dtype: float64

Per wikipedia, a value of more than 421 mm/h is considered "Extreme/large hail"
If we encounter the value 327.40 meter per hour, we should probably start building Noah's ark
Therefor, it seems reasonable to drop values too large, considered as outliers


In [325]:
# Considering that the gauge may concentrate the rainfall, we set the cap to 1000
# Comment this line to analyse the complete dataset 
l = len(raw)
raw = raw[raw['Expected'] < 300]  #1000
print("Dropped %d (%0.2f%%)"%(l-len(raw),(l-len(raw))/float(l)*100))


Dropped 6241 (6.24%)

In [326]:
raw.head(5)


Out[326]:
minutes_past radardist_km Ref Ref_5x5_10th Ref_5x5_50th Ref_5x5_90th RefComposite RefComposite_5x5_10th RefComposite_5x5_50th RefComposite_5x5_90th ... RhoHV_5x5_90th Zdr Zdr_5x5_10th Zdr_5x5_50th Zdr_5x5_90th Kdp Kdp_5x5_10th Kdp_5x5_50th Kdp_5x5_90th Expected
Id
1 3 10 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.254
1 16 10 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.254
1 25 10 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.254
1 35 10 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.254
1 45 10 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.254

5 rows × 23 columns


In [327]:
raw.describe()


Out[327]:
minutes_past radardist_km Ref Ref_5x5_10th Ref_5x5_50th Ref_5x5_90th RefComposite RefComposite_5x5_10th RefComposite_5x5_50th RefComposite_5x5_90th ... RhoHV_5x5_90th Zdr Zdr_5x5_10th Zdr_5x5_50th Zdr_5x5_90th Kdp Kdp_5x5_10th Kdp_5x5_50th Kdp_5x5_90th Expected
count 93759.000000 93759.000000 44923.000000 38391.000000 45078.000000 53038.000000 47946.000000 42219.000000 48042.000000 55118.000000 ... 42064.000000 35938.000000 30835.000000 35925.000000 42064.000000 31231.000000 26418.000000 31283.000000 36505.000000 93759.000000
mean 29.684830 11.022334 23.684482 20.788948 23.378688 26.427731 25.424874 22.956797 25.139201 27.982365 ... 1.014742 0.597837 -0.564851 0.429577 2.018197 -0.013098 -3.383198 -0.429909 3.855601 5.837679
std 17.418876 4.259865 10.224306 9.073503 9.936862 11.186952 10.627954 9.638466 10.372406 11.535892 ... 0.045336 1.388384 0.974288 0.864887 1.539513 3.747791 2.771442 2.194894 3.762005 22.764656
min 0.000000 0.000000 -29.000000 -31.500000 -31.500000 -26.500000 -26.500000 -27.500000 -25.000000 -23.000000 ... 0.208333 -7.875000 -7.875000 -7.875000 -7.875000 -52.880005 -51.420000 -46.870010 -41.540010 0.010000
25% 15.000000 9.000000 17.500000 15.500000 17.500000 19.500000 19.000000 17.500000 18.500000 20.500000 ... 0.998333 -0.062500 -1.000000 0.062500 1.125000 -1.410004 -4.230011 -0.710007 1.759994 0.254000
50% 30.000000 12.000000 24.000000 21.000000 23.500000 27.000000 25.500000 23.000000 25.500000 28.500000 ... 1.005000 0.500000 -0.500000 0.375000 1.687500 0.000000 -2.809998 0.000000 3.169998 0.830000
75% 45.000000 14.000000 30.500000 27.000000 30.500000 34.500000 33.000000 29.500000 32.500000 36.500000 ... 1.051667 1.125000 0.000000 0.750000 2.500000 1.409988 -1.740006 0.349991 5.289993 2.794001
max 59.000000 21.000000 64.500000 57.000000 61.500000 67.500000 68.000000 59.500000 64.000000 79.500000 ... 1.051667 7.937500 5.937500 7.937500 7.937500 47.849990 1.759994 5.629990 43.209990 244.000120

8 rows × 23 columns

We regroup the data by ID


In [328]:
# We select all features except for the minutes past,
# because we ignore the time repartition of the sequence for now

features_columns = list([u'Ref', u'Ref_5x5_10th',
       u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
       u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
       u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
       u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
       u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
       u'Kdp_5x5_50th', u'Kdp_5x5_90th'])

def getXy(raw):
    selected_columns = list([ u'minutes_past',u'radardist_km', u'Ref', u'Ref_5x5_10th',
       u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
       u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
       u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
       u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
       u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
       u'Kdp_5x5_50th', u'Kdp_5x5_90th'])
    
    data = raw[selected_columns]
    
    docX, docY = [], []
    for i in data.index.unique():
        if isinstance(data.loc[i],pd.core.series.Series):
            m = [data.loc[i].as_matrix()]
            docX.append(m)
            docY.append(float(raw.loc[i]["Expected"]))
        else:
            m = data.loc[i].as_matrix()
            docX.append(m)
            docY.append(float(raw.loc[i][:1]["Expected"]))
    X , y = np.array(docX) , np.array(docY)
    return X,y

We prepare 3 subsets

Fully filled data only

Used at first to evaluate the models and no worry about features completion


In [329]:
#noAnyNan = raw.loc[raw[features_columns].dropna(how='any').index.unique()]
noAnyNan = raw.dropna()

In [330]:
noFullNan = raw.loc[raw[features_columns].dropna(how='all').index.unique()]

In [331]:
fullNan = raw.drop(raw[features_columns].dropna(how='all').index)

In [332]:
print(len(raw))
print(len(noAnyNan))
print(len(noFullNan))
print(len(fullNan))


93759
22158
70149
23610

Predicitons

As a first try, we make predictions on the complete data, and return the 50th percentile and uncomplete and fully empty data


In [333]:
%%time
#X,y=getXy(noAnyNan)
X,y=getXy(noFullNan)


CPU times: user 4.49 s, sys: 56.4 ms, total: 4.55 s
Wall time: 4.63 s

In [334]:
len(X[0][0])


Out[334]:
22

In [335]:
#%%time
#XX = [np.array(t).mean(0) for t in X]
#XX = np.array([np.append(np.nanmean(np.array(t),0),(np.array(t)[1:] - np.array(t)[:-1]).sum(0) ) for t in X])

In [336]:
global_means = np.nanmean(noFullNan,0)

XX=[]
for t in X:
    nm = np.nanmean(t,0)
    for idx,j in enumerate(nm):
        if np.isnan(j):
            nm[idx]=global_means[idx]
    XX.append(nm)
XX=np.array(XX)

# rescale to clip min at 0 (for non negative matrix factorization)
XX_rescaled=XX[:,:]-np.min(XX,0)

In [337]:
%%time
nmf = ProjectedGradientNMF()
W = nmf.fit_transform(XX_rescaled)
#H = nn.components_


CPU times: user 4.07 s, sys: 201 ms, total: 4.27 s
Wall time: 3.77 s

In [ ]:

%%time nmf = NMF(max_iter=1000) W = nmf.fit_transform(XX) H = nmf.components_

In [290]:
# used to fill fully empty datas
global_means = np.nanmean(noFullNan,0)

# reduce the sequence structure of the data and produce
# new hopefully informatives features
def addFeatures(X,mf=0):
    # used to fill fully empty datas
    #global_means = np.nanmean(X,0)
    
    XX=[]
    nbFeatures=float(len(X[0][0]))
    for idxt,t in enumerate(X):
        
        # compute means, ignoring nan when possible, marking it when fully filled with nan
        nm = np.nanmean(t,0)
        tt=[]
        for idx,j in enumerate(nm):
            if np.isnan(j):
                nm[idx]=global_means[idx]
                tt.append(1)
            else:
                tt.append(0)
        tmp = np.append(nm,np.append(tt,tt.count(0)/nbFeatures))
        
        # faster if working on fully filled data:
        #tmp = np.append(np.nanmean(np.array(t),0),(np.array(t)[1:] - np.array(t)[:-1]).sum(0) )
        
        # add the percentiles
        tmp = np.append(tmp,np.nanpercentile(t,10,axis=0))
        tmp = np.append(tmp,np.nanpercentile(t,50,axis=0))
        tmp = np.append(tmp,np.nanpercentile(t,90,axis=0))
        
        for idx,i in enumerate(tmp):
            if np.isnan(i):
                tmp[idx]=0

        # adding the dbz as a feature
        test = t
        try:
            taa=test[:,0]
        except TypeError:
            taa=[test[0][0]]
        valid_time = np.zeros_like(taa)
        valid_time[0] = taa[0]
        for n in xrange(1,len(taa)):
            valid_time[n] = taa[n] - taa[n-1]
        valid_time[-1] = valid_time[-1] + 60 - np.sum(valid_time)
        valid_time = valid_time / 60.0


        sum=0
        try:
            column_ref=test[:,2]
        except TypeError:
            column_ref=[test[0][2]]
        for dbz, hours in zip(column_ref, valid_time):
            # See: https://en.wikipedia.org/wiki/DBZ_(meteorology)
            if np.isfinite(dbz):
                mmperhr = pow(pow(10, dbz/10)/200, 0.625)
                sum = sum + mmperhr * hours
        
        if not(mf is 0):
            tmp = np.append(tmp,mf[idxt])

        XX.append(np.append(np.array(sum),tmp))
        #XX.append(np.array([sum]))
        #XX.append(tmp)
    return np.array(XX)

In [263]:
%%time
XX=addFeatures(X,mf=W)
#XX=addFeatures(X)


CPU times: user 20.8 s, sys: 156 ms, total: 21 s
Wall time: 21.2 s

In [265]:
XX[:10,0]


Out[265]:
array([ 0.63081403,  0.        ,  2.71759937,  0.04063975,  0.70164463,
        5.10341586,  0.59982168,  4.39959803,  0.24530427,  0.39020773])
(X[0][1:] - X[0][:-1]).sum(0) (t[1:] - t[:-1]).sum(0) %%time etreg = etreg.fit(XX,y)

In [121]:
def splitTrainTest(X, y, split=0.2):
    tmp1, tmp2 = [], []
    ps = int(len(X) * (1-split))
    index_shuf = range(len(X))
    random.shuffle(index_shuf)
    for i in index_shuf:
        tmp1.append(X[i])
        tmp2.append(y[i])
    return tmp1[:ps], tmp2[:ps], tmp1[ps:], tmp2[ps:]

In [122]:
X_train,y_train, X_test, y_test = splitTrainTest(XX,y)


In [123]:
def manualScorer(estimator, X, y):
    err = (estimator.predict(X_test)-y_test)**2
    return -err.sum()/len(err)

max prof 24 nb trees 84 min sample per leaf 17 min sample to split 51


In [185]:
svr = SVR(kernel='rbf', C=800.0)
parameters = {'C':range(500,900,50)} grid_svr = grid_search.GridSearchCV(svr, parameters,scoring=manualScorer) grid_svr.fit(X_train,y_train) print(grid_svr.grid_scores_) print("Best: ",grid_svr.best_params_)

In [186]:
%%time
srv = svr.fit(X_train,y_train)


CPU times: user 7.9 s, sys: 85.4 ms, total: 7.99 s
Wall time: 8.08 s

In [184]:
print(svr.score(X_train,y_train))
print(svr.score(X_test,y_test))


0.984176859055
0.40960167131

In [131]:
err = (svr.predict(X_train)-y_train)**2
err.sum()/len(err)


Out[131]:
330.21588147846546

In [132]:
err = (svr.predict(X_test)-y_test)**2
err.sum()/len(err)


Out[132]:
284.10441792226254

In [134]:
%%time
svr_score = cross_val_score(svr, XX, y, cv=5)
print("Score: %s\nMean: %.03f"%(svr_score,svr_score.mean()))


Score: [-0.02459267 -0.01029681 -0.00540083 -0.00335848 -0.01941362]
Mean: -0.013
CPU times: user 25.3 s, sys: 173 ms, total: 25.5 s
Wall time: 25.5 s


In [135]:
knn = KNeighborsRegressor(n_neighbors=6,weights='distance',algorithm='ball_tree')

In [29]:
#parameters = {'weights':('distance','uniform'),'algorithm':('auto', 'ball_tree', 'kd_tree', 'brute')}
parameters = {'n_neighbors':range(1,10,1)}
grid_knn = grid_search.GridSearchCV(knn, parameters,scoring=manualScorer)

In [30]:
%%time
grid_knn.fit(X_train,y_train)


CPU times: user 2.51 s, sys: 0 ns, total: 2.51 s
Wall time: 2.51 s
Out[30]:
GridSearchCV(cv=None,
       estimator=KNeighborsRegressor(algorithm='ball_tree', leaf_size=30, metric='minkowski',
          metric_params=None, n_neighbors=6, p=2, weights='distance'),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9]},
       pre_dispatch='2*n_jobs', refit=True, score_func=None,
       scoring=<function manualScorer at 0x7fcc54db8050>, verbose=0)

In [31]:
print(grid_knn.grid_scores_)
print("Best: ",grid_knn.best_params_)


[mean: nan, std: nan, params: {'n_neighbors': 1}, mean: nan, std: nan, params: {'n_neighbors': 2}, mean: nan, std: nan, params: {'n_neighbors': 3}, mean: nan, std: nan, params: {'n_neighbors': 4}, mean: nan, std: nan, params: {'n_neighbors': 5}, mean: nan, std: nan, params: {'n_neighbors': 6}, mean: nan, std: nan, params: {'n_neighbors': 7}, mean: nan, std: nan, params: {'n_neighbors': 8}, mean: nan, std: nan, params: {'n_neighbors': 9}]
Best:  {'n_neighbors': 1}

In [32]:
knn = grid_knn.best_estimator_

In [136]:
knn= knn.fit(X_train,y_train)

In [137]:
print(knn.score(X_train,y_train))
print(knn.score(X_test,y_test))


1.0
0.66232912873

In [72]:
err = (knn.predict(X_train)-y_train)**2
err.sum()/len(err)


Out[72]:
1.2703084913435574

In [73]:
err = (knn.predict(X_test)-y_test)**2
err.sum()/len(err)


Out[73]:
85.489427551154463


In [138]:
etreg = ExtraTreesRegressor(n_estimators=200, max_depth=None, min_samples_split=1, random_state=0)

In [55]:
parameters = {'n_estimators':range(100,200,20)}
grid_rf = grid_search.GridSearchCV(etreg, parameters,n_jobs=2,scoring=manualScorer)

In [37]:
%%time
grid_rf.fit(X_train,y_train)


CPU times: user 4.44 s, sys: 241 ms, total: 4.68 s
Wall time: 16.3 s
Out[37]:
GridSearchCV(cv=None, error_score='raise',
       estimator=ExtraTreesRegressor(bootstrap=False, criterion='mse', max_depth=None,
          max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
          min_samples_split=1, min_weight_fraction_leaf=0.0,
          n_estimators=200, n_jobs=1, oob_score=False, random_state=0,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, loss_func=None, n_jobs=2,
       param_grid={'n_estimators': [100, 120, 140, 160, 180]},
       pre_dispatch='2*n_jobs', refit=True, score_func=None,
       scoring=<function manualScorer at 0x110aa0ed8>, verbose=0)

In [38]:
print(grid_rf.grid_scores_)
print("Best: ",grid_rf.best_params_)


[mean: -55.73522, std: 40.35044, params: {'n_estimators': 100}, mean: -55.47051, std: 39.85010, params: {'n_estimators': 120}, mean: -56.18434, std: 40.62698, params: {'n_estimators': 140}, mean: -56.15046, std: 40.74838, params: {'n_estimators': 160}, mean: -56.37052, std: 40.72395, params: {'n_estimators': 180}]
Best:  {'n_estimators': 120}

In [39]:
grid_rf.best_params_


Out[39]:
{'n_estimators': 120}

In [135]:
#etreg = grid_rf.best_estimator_

In [139]:
%%time
etreg = etreg.fit(X_train,y_train)


CPU times: user 26.3 s, sys: 123 ms, total: 26.5 s
Wall time: 26.5 s

In [140]:
print(etreg.score(X_train,y_train))
print(etreg.score(X_test,y_test))


1.0
0.796451392571

In [77]:
err = (etreg.predict(X_train)-y_train)**2
err.sum()/len(err)


Out[77]:
1.2703084913435574

In [78]:
err = (etreg.predict(X_test)-y_test)**2
err.sum()/len(err)


Out[78]:
87.171223669446121


In [158]:
rfr = RandomForestRegressor(n_estimators=200, criterion='mse', max_depth=None, min_samples_split=2,
                            min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto',
                            max_leaf_nodes=None, bootstrap=True, oob_score=False, n_jobs=-1,
                            random_state=None, verbose=0, warm_start=False)

In [159]:
%%time
rfr = rfr.fit(X_train,y_train)


CPU times: user 2min 22s, sys: 420 ms, total: 2min 22s
Wall time: 39.5 s

In [160]:
print(rfr.score(X_train,y_train))
print(rfr.score(X_test,y_test))


0.945126763577
0.62248085166


In [141]:
# the dbz feature does not influence xgbr so much
xgbr = xgb.XGBRegressor(max_depth=6, learning_rate=0.1, n_estimators=700, silent=True,
                        objective='reg:linear', nthread=-1, gamma=0, min_child_weight=1,
                        max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1,
                        reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5,
                        seed=0, missing=None)

In [142]:
%%time
xgbr = xgbr.fit(X_train,y_train)


CPU times: user 1min 3s, sys: 1.75 s, total: 1min 5s
Wall time: 20.4 s

In [119]:
# without the nmf features
# print(xgbr.score(X_train,y_train))
## 0.993948231144
# print(xgbr.score(X_test,y_test))
## 0.613931733332


0.993948231144
0.613931733332

In [143]:
# with nmf features
print(xgbr.score(X_train,y_train))
print(xgbr.score(X_test,y_test))


0.999679564454
0.691887473333


In [144]:
gbr = GradientBoostingRegressor(loss='ls', learning_rate=0.1, n_estimators=900,
                                subsample=1.0, min_samples_split=2, min_samples_leaf=1, max_depth=4, init=None,
                                random_state=None, max_features=None, alpha=0.5,
                                verbose=0, max_leaf_nodes=None, warm_start=False)

In [145]:
%%time
gbr = gbr.fit(X_train,y_train)
#os.system('say "終わりだ"') #its over!


CPU times: user 52 s, sys: 245 ms, total: 52.3 s
Wall time: 52.5 s

In [82]:
#parameters = {'max_depth':range(2,5,1),'alpha':[0.5,0.6,0.7,0.8,0.9]}
#parameters = {'subsample':[0.2,0.4,0.5,0.6,0.8,1]}
#parameters = {'subsample':[0.2,0.5,0.6,0.8,1],'n_estimators':[800,1000,1200]}
#parameters = {'max_depth':range(2,4,1)}
parameters = {'n_estimators':[400,800,1100]}
#parameters = {'loss':['ls', 'lad', 'huber', 'quantile'],'alpha':[0.3,0.5,0.8,0.9]}
#parameters = {'learning_rate':[0.1,0.5,0.9]}



grid_gbr = grid_search.GridSearchCV(gbr, parameters,n_jobs=2,scoring=manualScorer)

In [83]:
%%time
grid_gbr = grid_gbr.fit(X_train,y_train)


CPU times: user 38.1 s, sys: 566 ms, total: 38.7 s
Wall time: 2min 40s

In [84]:
print(grid_gbr.grid_scores_)
print("Best: ",grid_gbr.best_params_)


[mean: -132.42728, std: 13.82615, params: {'n_estimators': 400}, mean: -129.63912, std: 18.26752, params: {'n_estimators': 800}, mean: -131.56520, std: 14.32391, params: {'n_estimators': 1100}]
Best:  {'n_estimators': 800}

In [146]:
print(gbr.score(X_train,y_train))
print(gbr.score(X_test,y_test))


0.997514944924
0.683160735998

In [81]:
err = (gbr.predict(X_train)-y_train)**2
print(err.sum()/len(err))
err = (gbr.predict(X_test)-y_test)**2
print(err.sum()/len(err))


2.7346018758
92.3521455159

In [46]:
err = (gbr.predict(X_train)-y_train)**2
print(err.sum()/len(err))
err = (gbr.predict(X_test)-y_test)**2
print(err.sum()/len(err))


17.0046462288
1578.69166275


In [57]:
t = []
for i in XX:
    t.append(np.count_nonzero(~np.isnan(i)) / float(i.size))
pd.DataFrame(np.array(t)).describe()


Out[57]:
0
count 3093
mean 1
std 0
min 1
25% 1
50% 1
75% 1
max 1


In [215]:
svr.predict(X_test)


Out[215]:
array([  0.35491106,   2.20000364,   4.16269146, ...,   0.3539263 ,
         3.95877051,  14.5789092 ])

In [202]:
s = modelList[0]

In [229]:
t.mean(1)


Out[229]:
array([ 1.,  2.,  3.])

In [244]:
modelList = [svr,knn,etreg,rfr,xgbr,gbr]
reducedModelList = [knn,etreg,xgbr,gbr]

In [214]:
score_train = [[str(f).split("(")[0],f.score(X_train,y_train)] for f in modelList]
score_test = [[str(f).split("(")[0],f.score(X_test,y_test)] for f in modelList]
for idx,i in enumerate(score_train):
    print(i[0])
    print("   train: %.03f"%i[1])
    print("   test:  %.03f"%score_test[idx][1])


SVR
   train: 0.943
   test:  0.638
KNeighborsRegressor
   train: 1.000
   test:  0.662
ExtraTreesRegressor
   train: 1.000
   test:  0.796
RandomForestRegressor
   train: 0.945
   test:  0.622
XGBRegressor
   train: 1.000
   test:  0.692
GradientBoostingRegressor
   train: 0.998
   test:  0.683

In [245]:
#reducedModelList = [knn,etreg,xgbr,gbr]
globalPred = np.array([f.predict(XX) for f in reducedModelList]).T
#globalPred.mean(1)

In [226]:
globalPred[0]


Out[226]:
array([ 0.91597311,  1.0160005 ,  1.0160005 ,  1.26184066,  1.03321815,
        0.79693084])

In [227]:
y[0]


Out[227]:
1.0160004999999999

In [246]:
err = (globalPred.mean(1)-y)**2
print(err.sum()/len(err))


13.2197419163

In [232]:
err = (globalPred.mean(1)-y)**2
print(err.sum()/len(err))


15.5247409167

In [242]:
for f in modelList:
    print(str(f).split("(")[0])
    err = (f.predict(XX)-y)**2
    print(err.sum()/len(err))


SVR
35.3708083921
KNeighborsRegressor
18.9510626637
ExtraTreesRegressor
11.4237345969
RandomForestRegressor
35.7356750905
XGBRegressor
17.3771183153
GradientBoostingRegressor
18.4407805712

In [266]:
err = (XX[:,0]-y)**2
print(err.sum()/len(err))


333.703190005

In [243]:
for f in modelList:
    print(str(f).split("(")[0])
    print(f.score(XX,y))


SVR
0.889897078226
KNeighborsRegressor
0.941008773482
ExtraTreesRegressor
0.964439982747
RandomForestRegressor
0.888761314262
XGBRegressor
0.945908177237
GradientBoostingRegressor
0.942597189237

In [ ]:
XX[:10,0] # feature 0 is marshall-palmer

In [238]:
svrMeta = SVR()

In [239]:
%%time
svrMeta = svrMeta.fit(globalPred,y)


CPU times: user 766 ms, sys: 12.4 ms, total: 778 ms
Wall time: 781 ms

In [241]:
err = (svrMeta.predict(globalPred)-y)**2
print(err.sum()/len(err))


288.194411898

Here for legacy


In [41]:
in_dim = len(XX[0])
out_dim = 1  

model = Sequential()
# Dense(64) is a fully-connected layer with 64 hidden units.
# in the first layer, you must specify the expected input data shape:
# here, 20-dimensional vectors.
model.add(Dense(128, input_shape=(in_dim,)))
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Dense(1, init='uniform'))
model.add(Activation('linear'))

#sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
#model.compile(loss='mean_squared_error', optimizer=sgd)

rms = RMSprop()
model.compile(loss='mean_squared_error', optimizer=rms)

#model.fit(X_train, y_train, nb_epoch=20, batch_size=16)
#score = model.evaluate(X_test, y_test, batch_size=16)

In [42]:
prep = []
for i in y_train:
    prep.append(min(i,20))

In [43]:
prep=np.array(prep)
mi,ma = prep.min(),prep.max()
fy = (prep-mi) / (ma-mi)
#my = fy.max()
#fy = fy/fy.max()

In [44]:
model.fit(np.array(X_train), fy, batch_size=10, nb_epoch=10, validation_split=0.1)


Train on 2224 samples, validate on 248 samples
Epoch 1/10
2224/2224 [==============================] - 0s - loss: 0.0628 - val_loss: 0.0506
Epoch 2/10
2224/2224 [==============================] - 0s - loss: 0.0537 - val_loss: 0.0509
Epoch 3/10
2224/2224 [==============================] - 0s - loss: 0.0517 - val_loss: 0.0541
Epoch 4/10
2224/2224 [==============================] - 0s - loss: 0.0520 - val_loss: 0.0500
Epoch 5/10
2224/2224 [==============================] - 0s - loss: 0.0519 - val_loss: 0.0525
Epoch 6/10
2224/2224 [==============================] - 0s - loss: 0.0523 - val_loss: 0.0489
Epoch 7/10
2224/2224 [==============================] - 0s - loss: 0.0517 - val_loss: 0.0551
Epoch 8/10
2224/2224 [==============================] - 0s - loss: 0.0512 - val_loss: 0.0498
Epoch 9/10
2224/2224 [==============================] - 0s - loss: 0.0516 - val_loss: 0.0583
Epoch 10/10
2224/2224 [==============================] - 0s - loss: 0.0516 - val_loss: 0.0497
Out[44]:
<keras.callbacks.History at 0x112430c10>

In [45]:
pred = model.predict(np.array(X_test))*ma+mi

In [46]:
err = (pred-y_test)**2
err.sum()/len(err)


Out[46]:
182460.82171163053

In [ ]:
r = random.randrange(len(X_train))
print("(Train) Prediction %0.4f, True: %0.4f"%(model.predict(np.array([X_train[r]]))[0][0]*ma+mi,y_train[r]))

r = random.randrange(len(X_test))
print("(Test)  Prediction %0.4f, True: %0.4f"%(model.predict(np.array([X_test[r]]))[0][0]*ma+mi,y_test[r]))

Predict on testset


In [338]:
%%time
filename = "data/reduced_test_5000.csv"
#filename = "data/test.csv"
test = pd.read_csv(filename)
test = test.set_index('Id')


CPU times: user 10.3 ms, sys: 2.9 ms, total: 13.2 ms
Wall time: 12.4 ms

In [339]:
features_columns = list([u'Ref', u'Ref_5x5_10th',
       u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
       u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
       u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
       u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
       u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
       u'Kdp_5x5_50th', u'Kdp_5x5_90th'])

def getX(raw):
    selected_columns = list([ u'minutes_past',u'radardist_km', u'Ref', u'Ref_5x5_10th',
       u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
       u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
       u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
       u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
       u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
       u'Kdp_5x5_50th', u'Kdp_5x5_90th'])
    
    data = raw[selected_columns]
    
    docX= []
    for i in data.index.unique():
        if isinstance(data.loc[i],pd.core.series.Series):
            m = [data.loc[i].as_matrix()]
            docX.append(m)
        else:
            m = data.loc[i].as_matrix()
            docX.append(m)
    X = np.array(docX)
    return X

In [340]:
#%%time
#X=getX(test)

#tmp = []
#for i in X:
#    tmp.append(len(i))
#tmp = np.array(tmp)
#sns.countplot(tmp,order=range(tmp.min(),tmp.max()+1))
#plt.title("Number of ID per number of observations\n(On test dataset)")
#plt.plot()

In [341]:
#testFull = test.dropna()
testNoFullNan = test.loc[test[features_columns].dropna(how='all').index.unique()]

In [342]:
%%time
X=getX(testNoFullNan)  # 1min
#XX = [np.array(t).mean(0) for t in X]  # 10s


CPU times: user 107 ms, sys: 2.27 ms, total: 109 ms
Wall time: 110 ms

In [ ]:


In [343]:
XX=[]
for t in X:
    nm = np.nanmean(t,0)
    for idx,j in enumerate(nm):
        if np.isnan(j):
            nm[idx]=global_means[idx]
    XX.append(nm)
XX=np.array(XX)

# rescale to clip min at 0 (for non negative matrix factorization)
XX_rescaled=XX[:,:]-np.min(XX,0)

In [344]:
%%time
W = nmf.transform(XX_rescaled)


CPU times: user 11.7 ms, sys: 2.26 ms, total: 13.9 ms
Wall time: 13.3 ms

In [345]:
XX=addFeatures(X,mf=W)

In [346]:
pd.DataFrame(xgbr.predict(XX)).describe()


Out[346]:
0
count 270.000000
mean 39.863171
std 19.231586
min 12.168386
25% 25.052711
50% 32.528101
75% 56.126141
max 121.855644

In [348]:
reducedModelList = [knn,etreg,xgbr,gbr]
globalPred = np.array([f.predict(XX) for f in reducedModelList]).T
predTest = globalPred.mean(1)

In [100]:
predFull = zip(testNoFullNan.index.unique(),predTest)

In [101]:
testNan = test.drop(test[features_columns].dropna(how='all').index)

In [ ]:
pred = predFull + predNan

In [102]:
tmp = np.empty(len(testNan))
tmp.fill(0.445000)   # 50th percentile of full Nan dataset
predNan = zip(testNan.index.unique(),tmp)

In [103]:
testLeft = test.drop(testNan.index.unique()).drop(testFull.index.unique())

In [104]:
tmp = np.empty(len(testLeft))
tmp.fill(1.27)   # 50th percentile of full Nan dataset
predLeft = zip(testLeft.index.unique(),tmp)

In [105]:
len(testFull.index.unique())


Out[105]:
235515

In [106]:
len(testNan.index.unique())


Out[106]:
232148

In [107]:
len(testLeft.index.unique())


Out[107]:
249962

In [108]:
pred = predFull + predNan + predLeft

In [113]:
pred.sort(key=lambda x: x[0], reverse=False)

In [ ]:
#reducedModelList = [knn,etreg,xgbr,gbr]
globalPred = np.array([f.predict(XX) for f in reducedModelList]).T
#globalPred.mean(1)

In [114]:
submission = pd.DataFrame(pred)
submission.columns = ["Id","Expected"]
submission.head()


Out[114]:
Id Expected
0 1 1.270000
1 2 1.270000
2 3 2.361996
3 4 14.492731
4 5 0.445000

In [115]:
submission.loc[submission['Expected']<0,'Expected'] = 0.445

In [116]:
submission.to_csv("submit4.csv",index=False)

In [ ]:


In [73]:
filename = "data/sample_solution.csv"
sol = pd.read_csv(filename)

In [74]:
sol


Out[74]:
Id Expected
0 1 0.085765
1 2 0.000000
2 3 1.594004
3 4 6.913380
4 5 0.000000
5 6 0.173935
6 7 3.219921
7 8 0.867394
8 9 0.000000
9 10 14.182371
10 11 0.911013
11 12 0.034835
12 13 2.733501
13 14 0.709341
14 15 0.000000
15 16 0.000000
16 17 0.000000
17 18 0.393315
18 19 0.291799
19 20 0.000000
20 21 0.000000
21 22 2.031596
22 23 2.236317
23 24 0.000000
24 25 0.000000
25 26 0.009609
26 27 0.020464
27 28 0.000000
28 29 0.000000
29 30 3.074862
... ... ...
717595 717596 2.573862
717596 717597 0.000000
717597 717598 6.580513
717598 717599 0.270776
717599 717600 0.177539
717600 717601 1.133367
717601 717602 0.000000
717602 717603 9.055868
717603 717604 0.633148
717604 717605 17.524065
717605 717606 0.000000
717606 717607 0.000000
717607 717608 0.000000
717608 717609 14.180502
717609 717610 1.387969
717610 717611 0.000000
717611 717612 0.527286
717612 717613 0.000000
717613 717614 0.164476
717614 717615 2.652251
717615 717616 0.302655
717616 717617 0.183060
717617 717618 0.142695
717618 717619 0.000000
717619 717620 0.343296
717620 717621 0.064034
717621 717622 0.000000
717622 717623 1.090277
717623 717624 1.297023
717624 717625 0.000000

717625 rows × 2 columns


In [ ]:
ss = np.array(sol)

In [ ]:
%%time
for a,b in predFull:
    ss[a-1][1]=b

In [ ]:
ss

In [75]:
sub = pd.DataFrame(pred)
sub.columns = ["Id","Expected"]
sub.Id = sub.Id.astype(int)
sub.head()


Out[75]:
Id Expected
0 1 1.270000
1 2 1.270000
2 3 2.378660
3 4 8.851727
4 5 0.445000

In [76]:
sub.to_csv("submit3.csv",index=False)

In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]: