FDMS TME3

Florian Toque & Paul Willot

Notes

We tried different regressor model, like GBR, SVM, MLP, Random Forest and KNN as recommanded by the winning team of the Kaggle on taxi trajectories. So far GBR seems to be the best, slightly better than the RF.
The new features we exctracted only made a small impact on predictions but still improved them consistently.
We tried to use a LSTM to take advantage of the sequential structure of the data but it didn't work too well, probably because there is not enought data (13M lines divided per the average length of sequences (15), less the 30% of fully empty data)



In [321]:

    
# from __future__ import exam_success
from __future__ import absolute_import 
from __future__ import print_function  

# Standard imports
%matplotlib inline
import os
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import random
import pandas as pd
import scipy.stats as stats

# Sk cheats
from sklearn.cross_validation import cross_val_score
from sklearn import grid_search
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
#from sklearn.preprocessing import Imputer   # get rid of nan
from sklearn.decomposition import NMF        # to add features based on the latent representation
from sklearn.decomposition import ProjectedGradientNMF

# Faster gradient boosting
import xgboost as xgb

# For neural networks models
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation 
from keras.optimizers import SGD, RMSprop

13.765.202 lines in train.csv
8.022.757 lines in test.csv

Few words about the dataset

Predictions is made in the USA corn growing states (mainly Iowa, Illinois, Indiana) during the season with the highest rainfall (as illustrated by Iowa for the april to august months)

The Kaggle page indicate that the dataset have been shuffled, so working on a subset seems acceptable
The test set is not a extracted from the same data as the training set however, which make the evaluation trickier

Load the dataset



In [322]:

    
%%time
#filename = "data/train.csv"
filename = "data/reduced_train_100000.csv"
#filename = "data/reduced_train_1000000.csv"
raw = pd.read_csv(filename)
raw = raw.set_index('Id')









    



CPU times: user 245 ms, sys: 47 ms, total: 292 ms
Wall time: 296 ms



In [323]:

    
raw.columns









    Out[323]:





Index([u'minutes_past', u'radardist_km', u'Ref', u'Ref_5x5_10th',
       u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
       u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
       u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
       u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
       u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
       u'Kdp_5x5_50th', u'Kdp_5x5_90th', u'Expected'],
      dtype='object')



In [324]:

    
raw['Expected'].describe()









    Out[324]:





count    100000.000000
mean        129.579825
std         687.622542
min           0.010000
25%           0.254000
50%           1.016000
75%           3.556002
max       32740.617000
Name: Expected, dtype: float64

Per wikipedia, a value of more than 421 mm/h is considered "Extreme/large hail"
If we encounter the value 327.40 meter per hour, we should probably start building Noah's ark
Therefor, it seems reasonable to drop values too large, considered as outliers



In [325]:

    
# Considering that the gauge may concentrate the rainfall, we set the cap to 1000
# Comment this line to analyse the complete dataset 
l = len(raw)
raw = raw[raw['Expected'] < 300]  #1000
print("Dropped %d (%0.2f%%)"%(l-len(raw),(l-len(raw))/float(l)*100))









    



Dropped 6241 (6.24%)



In [326]:

    
raw.head(5)









    Out[326]:






  
    
      
      minutes_past
      radardist_km
      Ref
      Ref_5x5_10th
      Ref_5x5_50th
      Ref_5x5_90th
      RefComposite
      RefComposite_5x5_10th
      RefComposite_5x5_50th
      RefComposite_5x5_90th
      ...
      RhoHV_5x5_90th
      Zdr
      Zdr_5x5_10th
      Zdr_5x5_50th
      Zdr_5x5_90th
      Kdp
      Kdp_5x5_10th
      Kdp_5x5_50th
      Kdp_5x5_90th
      Expected
    
    
      Id
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      3
      10
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      0.254
    
    
      1
      16
      10
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      0.254
    
    
      1
      25
      10
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      0.254
    
    
      1
      35
      10
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      0.254
    
    
      1
      45
      10
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      0.254
    
  

5 rows × 23 columns



In [327]:

    
raw.describe()









    Out[327]:






  
    
      
      minutes_past
      radardist_km
      Ref
      Ref_5x5_10th
      Ref_5x5_50th
      Ref_5x5_90th
      RefComposite
      RefComposite_5x5_10th
      RefComposite_5x5_50th
      RefComposite_5x5_90th
      ...
      RhoHV_5x5_90th
      Zdr
      Zdr_5x5_10th
      Zdr_5x5_50th
      Zdr_5x5_90th
      Kdp
      Kdp_5x5_10th
      Kdp_5x5_50th
      Kdp_5x5_90th
      Expected
    
  
  
    
      count
      93759.000000
      93759.000000
      44923.000000
      38391.000000
      45078.000000
      53038.000000
      47946.000000
      42219.000000
      48042.000000
      55118.000000
      ...
      42064.000000
      35938.000000
      30835.000000
      35925.000000
      42064.000000
      31231.000000
      26418.000000
      31283.000000
      36505.000000
      93759.000000
    
    
      mean
      29.684830
      11.022334
      23.684482
      20.788948
      23.378688
      26.427731
      25.424874
      22.956797
      25.139201
      27.982365
      ...
      1.014742
      0.597837
      -0.564851
      0.429577
      2.018197
      -0.013098
      -3.383198
      -0.429909
      3.855601
      5.837679
    
    
      std
      17.418876
      4.259865
      10.224306
      9.073503
      9.936862
      11.186952
      10.627954
      9.638466
      10.372406
      11.535892
      ...
      0.045336
      1.388384
      0.974288
      0.864887
      1.539513
      3.747791
      2.771442
      2.194894
      3.762005
      22.764656
    
    
      min
      0.000000
      0.000000
      -29.000000
      -31.500000
      -31.500000
      -26.500000
      -26.500000
      -27.500000
      -25.000000
      -23.000000
      ...
      0.208333
      -7.875000
      -7.875000
      -7.875000
      -7.875000
      -52.880005
      -51.420000
      -46.870010
      -41.540010
      0.010000
    
    
      25%
      15.000000
      9.000000
      17.500000
      15.500000
      17.500000
      19.500000
      19.000000
      17.500000
      18.500000
      20.500000
      ...
      0.998333
      -0.062500
      -1.000000
      0.062500
      1.125000
      -1.410004
      -4.230011
      -0.710007
      1.759994
      0.254000
    
    
      50%
      30.000000
      12.000000
      24.000000
      21.000000
      23.500000
      27.000000
      25.500000
      23.000000
      25.500000
      28.500000
      ...
      1.005000
      0.500000
      -0.500000
      0.375000
      1.687500
      0.000000
      -2.809998
      0.000000
      3.169998
      0.830000
    
    
      75%
      45.000000
      14.000000
      30.500000
      27.000000
      30.500000
      34.500000
      33.000000
      29.500000
      32.500000
      36.500000
      ...
      1.051667
      1.125000
      0.000000
      0.750000
      2.500000
      1.409988
      -1.740006
      0.349991
      5.289993
      2.794001
    
    
      max
      59.000000
      21.000000
      64.500000
      57.000000
      61.500000
      67.500000
      68.000000
      59.500000
      64.000000
      79.500000
      ...
      1.051667
      7.937500
      5.937500
      7.937500
      7.937500
      47.849990
      1.759994
      5.629990
      43.209990
      244.000120
    
  

8 rows × 23 columns

We regroup the data by ID



In [328]:

    
# We select all features except for the minutes past,
# because we ignore the time repartition of the sequence for now

features_columns = list([u'Ref', u'Ref_5x5_10th',
       u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
       u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
       u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
       u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
       u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
       u'Kdp_5x5_50th', u'Kdp_5x5_90th'])

def getXy(raw):
    selected_columns = list([ u'minutes_past',u'radardist_km', u'Ref', u'Ref_5x5_10th',
       u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
       u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
       u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
       u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
       u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
       u'Kdp_5x5_50th', u'Kdp_5x5_90th'])
    
    data = raw[selected_columns]
    
    docX, docY = [], []
    for i in data.index.unique():
        if isinstance(data.loc[i],pd.core.series.Series):
            m = [data.loc[i].as_matrix()]
            docX.append(m)
            docY.append(float(raw.loc[i]["Expected"]))
        else:
            m = data.loc[i].as_matrix()
            docX.append(m)
            docY.append(float(raw.loc[i][:1]["Expected"]))
    X , y = np.array(docX) , np.array(docY)
    return X,y

We prepare 3 subsets

Fully filled data only

Used at first to evaluate the models and no worry about features completion



In [329]:

    
#noAnyNan = raw.loc[raw[features_columns].dropna(how='any').index.unique()]
noAnyNan = raw.dropna()



In [330]:

    
noFullNan = raw.loc[raw[features_columns].dropna(how='all').index.unique()]



In [331]:

    
fullNan = raw.drop(raw[features_columns].dropna(how='all').index)



In [332]:

    
print(len(raw))
print(len(noAnyNan))
print(len(noFullNan))
print(len(fullNan))

Predicitons

As a first try, we make predictions on the complete data, and return the 50th percentile and uncomplete and fully empty data



In [333]:

    
%%time
#X,y=getXy(noAnyNan)
X,y=getXy(noFullNan)









    



CPU times: user 4.49 s, sys: 56.4 ms, total: 4.55 s
Wall time: 4.63 s



In [334]:

    
len(X[0][0])









    Out[334]:





22



In [335]:

    
#%%time
#XX = [np.array(t).mean(0) for t in X]
#XX = np.array([np.append(np.nanmean(np.array(t),0),(np.array(t)[1:] - np.array(t)[:-1]).sum(0) ) for t in X])



In [336]:

    
global_means = np.nanmean(noFullNan,0)

XX=[]
for t in X:
    nm = np.nanmean(t,0)
    for idx,j in enumerate(nm):
        if np.isnan(j):
            nm[idx]=global_means[idx]
    XX.append(nm)
XX=np.array(XX)

# rescale to clip min at 0 (for non negative matrix factorization)
XX_rescaled=XX[:,:]-np.min(XX,0)



In [337]:

    
%%time
nmf = ProjectedGradientNMF()
W = nmf.fit_transform(XX_rescaled)
#H = nn.components_









    



CPU times: user 4.07 s, sys: 201 ms, total: 4.27 s
Wall time: 3.77 s



In [ ]:

%%time nmf = NMF(max_iter=1000) W = nmf.fit_transform(XX) H = nmf.components_



In [290]:

    
# used to fill fully empty datas
global_means = np.nanmean(noFullNan,0)

# reduce the sequence structure of the data and produce
# new hopefully informatives features
def addFeatures(X,mf=0):
    # used to fill fully empty datas
    #global_means = np.nanmean(X,0)
    
    XX=[]
    nbFeatures=float(len(X[0][0]))
    for idxt,t in enumerate(X):
        
        # compute means, ignoring nan when possible, marking it when fully filled with nan
        nm = np.nanmean(t,0)
        tt=[]
        for idx,j in enumerate(nm):
            if np.isnan(j):
                nm[idx]=global_means[idx]
                tt.append(1)
            else:
                tt.append(0)
        tmp = np.append(nm,np.append(tt,tt.count(0)/nbFeatures))
        
        # faster if working on fully filled data:
        #tmp = np.append(np.nanmean(np.array(t),0),(np.array(t)[1:] - np.array(t)[:-1]).sum(0) )
        
        # add the percentiles
        tmp = np.append(tmp,np.nanpercentile(t,10,axis=0))
        tmp = np.append(tmp,np.nanpercentile(t,50,axis=0))
        tmp = np.append(tmp,np.nanpercentile(t,90,axis=0))
        
        for idx,i in enumerate(tmp):
            if np.isnan(i):
                tmp[idx]=0

        # adding the dbz as a feature
        test = t
        try:
            taa=test[:,0]
        except TypeError:
            taa=[test[0][0]]
        valid_time = np.zeros_like(taa)
        valid_time[0] = taa[0]
        for n in xrange(1,len(taa)):
            valid_time[n] = taa[n] - taa[n-1]
        valid_time[-1] = valid_time[-1] + 60 - np.sum(valid_time)
        valid_time = valid_time / 60.0


        sum=0
        try:
            column_ref=test[:,2]
        except TypeError:
            column_ref=[test[0][2]]
        for dbz, hours in zip(column_ref, valid_time):
            # See: https://en.wikipedia.org/wiki/DBZ_(meteorology)
            if np.isfinite(dbz):
                mmperhr = pow(pow(10, dbz/10)/200, 0.625)
                sum = sum + mmperhr * hours
        
        if not(mf is 0):
            tmp = np.append(tmp,mf[idxt])

        XX.append(np.append(np.array(sum),tmp))
        #XX.append(np.array([sum]))
        #XX.append(tmp)
    return np.array(XX)



In [263]:

    
%%time
XX=addFeatures(X,mf=W)
#XX=addFeatures(X)









    



CPU times: user 20.8 s, sys: 156 ms, total: 21 s
Wall time: 21.2 s



In [265]:

    
XX[:10,0]









    Out[265]:





array([ 0.63081403,  0.        ,  2.71759937,  0.04063975,  0.70164463,
        5.10341586,  0.59982168,  4.39959803,  0.24530427,  0.39020773])

(X[0][1:] - X[0][:-1]).sum(0) (t[1:] - t[:-1]).sum(0) %%time etreg = etreg.fit(XX,y)



In [121]:

    
def splitTrainTest(X, y, split=0.2):
    tmp1, tmp2 = [], []
    ps = int(len(X) * (1-split))
    index_shuf = range(len(X))
    random.shuffle(index_shuf)
    for i in index_shuf:
        tmp1.append(X[i])
        tmp2.append(y[i])
    return tmp1[:ps], tmp2[:ps], tmp1[ps:], tmp2[ps:]



In [122]:

    
X_train,y_train, X_test, y_test = splitTrainTest(XX,y)



In [123]:

    
def manualScorer(estimator, X, y):
    err = (estimator.predict(X_test)-y_test)**2
    return -err.sum()/len(err)

max prof 24 nb trees 84 min sample per leaf 17 min sample to split 51



In [185]:

    
svr = SVR(kernel='rbf', C=800.0)

parameters = {'C':range(500,900,50)} grid_svr = grid_search.GridSearchCV(svr, parameters,scoring=manualScorer) grid_svr.fit(X_train,y_train) print(grid_svr.grid_scores_) print("Best: ",grid_svr.best_params_)



In [186]:

    
%%time
srv = svr.fit(X_train,y_train)









    



CPU times: user 7.9 s, sys: 85.4 ms, total: 7.99 s
Wall time: 8.08 s



In [184]:

    
print(svr.score(X_train,y_train))
print(svr.score(X_test,y_test))









    



0.984176859055
0.40960167131



In [131]:

    
err = (svr.predict(X_train)-y_train)**2
err.sum()/len(err)









    Out[131]:





330.21588147846546



In [132]:

    
err = (svr.predict(X_test)-y_test)**2
err.sum()/len(err)









    Out[132]:





284.10441792226254



In [134]:

    
%%time
svr_score = cross_val_score(svr, XX, y, cv=5)
print("Score: %s\nMean: %.03f"%(svr_score,svr_score.mean()))









    



Score: [-0.02459267 -0.01029681 -0.00540083 -0.00335848 -0.01941362]
Mean: -0.013
CPU times: user 25.3 s, sys: 173 ms, total: 25.5 s
Wall time: 25.5 s



In [135]:

    
knn = KNeighborsRegressor(n_neighbors=6,weights='distance',algorithm='ball_tree')



In [29]:

    
#parameters = {'weights':('distance','uniform'),'algorithm':('auto', 'ball_tree', 'kd_tree', 'brute')}
parameters = {'n_neighbors':range(1,10,1)}
grid_knn = grid_search.GridSearchCV(knn, parameters,scoring=manualScorer)



In [30]:

    
%%time
grid_knn.fit(X_train,y_train)









    



CPU times: user 2.51 s, sys: 0 ns, total: 2.51 s
Wall time: 2.51 s






    Out[30]:





GridSearchCV(cv=None,
       estimator=KNeighborsRegressor(algorithm='ball_tree', leaf_size=30, metric='minkowski',
          metric_params=None, n_neighbors=6, p=2, weights='distance'),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9]},
       pre_dispatch='2*n_jobs', refit=True, score_func=None,
       scoring=<function manualScorer at 0x7fcc54db8050>, verbose=0)



In [31]:

    
print(grid_knn.grid_scores_)
print("Best: ",grid_knn.best_params_)









    



[mean: nan, std: nan, params: {'n_neighbors': 1}, mean: nan, std: nan, params: {'n_neighbors': 2}, mean: nan, std: nan, params: {'n_neighbors': 3}, mean: nan, std: nan, params: {'n_neighbors': 4}, mean: nan, std: nan, params: {'n_neighbors': 5}, mean: nan, std: nan, params: {'n_neighbors': 6}, mean: nan, std: nan, params: {'n_neighbors': 7}, mean: nan, std: nan, params: {'n_neighbors': 8}, mean: nan, std: nan, params: {'n_neighbors': 9}]
Best:  {'n_neighbors': 1}



In [32]:

    
knn = grid_knn.best_estimator_



In [136]:

    
knn= knn.fit(X_train,y_train)



In [137]:

    
print(knn.score(X_train,y_train))
print(knn.score(X_test,y_test))









    



1.0
0.66232912873



In [72]:

    
err = (knn.predict(X_train)-y_train)**2
err.sum()/len(err)









    Out[72]:





1.2703084913435574



In [73]:

    
err = (knn.predict(X_test)-y_test)**2
err.sum()/len(err)









    Out[73]:





85.489427551154463



In [138]:

    
etreg = ExtraTreesRegressor(n_estimators=200, max_depth=None, min_samples_split=1, random_state=0)



In [55]:

    
parameters = {'n_estimators':range(100,200,20)}
grid_rf = grid_search.GridSearchCV(etreg, parameters,n_jobs=2,scoring=manualScorer)



In [37]:

    
%%time
grid_rf.fit(X_train,y_train)









    



CPU times: user 4.44 s, sys: 241 ms, total: 4.68 s
Wall time: 16.3 s






    Out[37]:





GridSearchCV(cv=None, error_score='raise',
       estimator=ExtraTreesRegressor(bootstrap=False, criterion='mse', max_depth=None,
          max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
          min_samples_split=1, min_weight_fraction_leaf=0.0,
          n_estimators=200, n_jobs=1, oob_score=False, random_state=0,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, loss_func=None, n_jobs=2,
       param_grid={'n_estimators': [100, 120, 140, 160, 180]},
       pre_dispatch='2*n_jobs', refit=True, score_func=None,
       scoring=<function manualScorer at 0x110aa0ed8>, verbose=0)



In [38]:

    
print(grid_rf.grid_scores_)
print("Best: ",grid_rf.best_params_)









    



[mean: -55.73522, std: 40.35044, params: {'n_estimators': 100}, mean: -55.47051, std: 39.85010, params: {'n_estimators': 120}, mean: -56.18434, std: 40.62698, params: {'n_estimators': 140}, mean: -56.15046, std: 40.74838, params: {'n_estimators': 160}, mean: -56.37052, std: 40.72395, params: {'n_estimators': 180}]
Best:  {'n_estimators': 120}



In [39]:

    
grid_rf.best_params_









    Out[39]:





{'n_estimators': 120}



In [135]:

    
#etreg = grid_rf.best_estimator_



In [139]:

    
%%time
etreg = etreg.fit(X_train,y_train)









    



CPU times: user 26.3 s, sys: 123 ms, total: 26.5 s
Wall time: 26.5 s



In [140]:

    
print(etreg.score(X_train,y_train))
print(etreg.score(X_test,y_test))









    



1.0
0.796451392571



In [77]:

    
err = (etreg.predict(X_train)-y_train)**2
err.sum()/len(err)









    Out[77]:





1.2703084913435574



In [78]:

    
err = (etreg.predict(X_test)-y_test)**2
err.sum()/len(err)









    Out[78]:





87.171223669446121



In [158]:

    
rfr = RandomForestRegressor(n_estimators=200, criterion='mse', max_depth=None, min_samples_split=2,
                            min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto',
                            max_leaf_nodes=None, bootstrap=True, oob_score=False, n_jobs=-1,
                            random_state=None, verbose=0, warm_start=False)



In [159]:

    
%%time
rfr = rfr.fit(X_train,y_train)









    



CPU times: user 2min 22s, sys: 420 ms, total: 2min 22s
Wall time: 39.5 s



In [160]:

    
print(rfr.score(X_train,y_train))
print(rfr.score(X_test,y_test))









    



0.945126763577
0.62248085166



In [141]:

    
# the dbz feature does not influence xgbr so much
xgbr = xgb.XGBRegressor(max_depth=6, learning_rate=0.1, n_estimators=700, silent=True,
                        objective='reg:linear', nthread=-1, gamma=0, min_child_weight=1,
                        max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1,
                        reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5,
                        seed=0, missing=None)



In [142]:

    
%%time
xgbr = xgbr.fit(X_train,y_train)









    



CPU times: user 1min 3s, sys: 1.75 s, total: 1min 5s
Wall time: 20.4 s



In [119]:

    
# without the nmf features
# print(xgbr.score(X_train,y_train))
## 0.993948231144
# print(xgbr.score(X_test,y_test))
## 0.613931733332









    



0.993948231144
0.613931733332



In [143]:

    
# with nmf features
print(xgbr.score(X_train,y_train))
print(xgbr.score(X_test,y_test))









    



0.999679564454
0.691887473333



In [144]:

    
gbr = GradientBoostingRegressor(loss='ls', learning_rate=0.1, n_estimators=900,
                                subsample=1.0, min_samples_split=2, min_samples_leaf=1, max_depth=4, init=None,
                                random_state=None, max_features=None, alpha=0.5,
                                verbose=0, max_leaf_nodes=None, warm_start=False)



In [145]:

    
%%time
gbr = gbr.fit(X_train,y_train)
#os.system('say "終わりだ"') #its over!









    



CPU times: user 52 s, sys: 245 ms, total: 52.3 s
Wall time: 52.5 s



In [82]:

    
#parameters = {'max_depth':range(2,5,1),'alpha':[0.5,0.6,0.7,0.8,0.9]}
#parameters = {'subsample':[0.2,0.4,0.5,0.6,0.8,1]}
#parameters = {'subsample':[0.2,0.5,0.6,0.8,1],'n_estimators':[800,1000,1200]}
#parameters = {'max_depth':range(2,4,1)}
parameters = {'n_estimators':[400,800,1100]}
#parameters = {'loss':['ls', 'lad', 'huber', 'quantile'],'alpha':[0.3,0.5,0.8,0.9]}
#parameters = {'learning_rate':[0.1,0.5,0.9]}



grid_gbr = grid_search.GridSearchCV(gbr, parameters,n_jobs=2,scoring=manualScorer)



In [83]:

    
%%time
grid_gbr = grid_gbr.fit(X_train,y_train)









    



CPU times: user 38.1 s, sys: 566 ms, total: 38.7 s
Wall time: 2min 40s



In [84]:

    
print(grid_gbr.grid_scores_)
print("Best: ",grid_gbr.best_params_)









    



[mean: -132.42728, std: 13.82615, params: {'n_estimators': 400}, mean: -129.63912, std: 18.26752, params: {'n_estimators': 800}, mean: -131.56520, std: 14.32391, params: {'n_estimators': 1100}]
Best:  {'n_estimators': 800}



In [146]:

    
print(gbr.score(X_train,y_train))
print(gbr.score(X_test,y_test))









    



0.997514944924
0.683160735998



In [81]:

    
err = (gbr.predict(X_train)-y_train)**2
print(err.sum()/len(err))
err = (gbr.predict(X_test)-y_test)**2
print(err.sum()/len(err))









    



2.7346018758
92.3521455159



In [46]:

    
err = (gbr.predict(X_train)-y_train)**2
print(err.sum()/len(err))
err = (gbr.predict(X_test)-y_test)**2
print(err.sum()/len(err))









    



17.0046462288
1578.69166275



In [57]:

    
t = []
for i in XX:
    t.append(np.count_nonzero(~np.isnan(i)) / float(i.size))
pd.DataFrame(np.array(t)).describe()



In [215]:

    
svr.predict(X_test)









    Out[215]:





array([  0.35491106,   2.20000364,   4.16269146, ...,   0.3539263 ,
         3.95877051,  14.5789092 ])



In [202]:

    
s = modelList[0]



In [229]:

    
t.mean(1)









    Out[229]:





array([ 1.,  2.,  3.])



In [244]:

    
modelList = [svr,knn,etreg,rfr,xgbr,gbr]
reducedModelList = [knn,etreg,xgbr,gbr]



In [214]:

    
score_train = [[str(f).split("(")[0],f.score(X_train,y_train)] for f in modelList]
score_test = [[str(f).split("(")[0],f.score(X_test,y_test)] for f in modelList]
for idx,i in enumerate(score_train):
    print(i[0])
    print("   train: %.03f"%i[1])
    print("   test:  %.03f"%score_test[idx][1])









    



SVR
   train: 0.943
   test:  0.638
KNeighborsRegressor
   train: 1.000
   test:  0.662
ExtraTreesRegressor
   train: 1.000
   test:  0.796
RandomForestRegressor
   train: 0.945
   test:  0.622
XGBRegressor
   train: 1.000
   test:  0.692
GradientBoostingRegressor
   train: 0.998
   test:  0.683



In [245]:

    
#reducedModelList = [knn,etreg,xgbr,gbr]
globalPred = np.array([f.predict(XX) for f in reducedModelList]).T
#globalPred.mean(1)



In [226]:

    
globalPred[0]









    Out[226]:





array([ 0.91597311,  1.0160005 ,  1.0160005 ,  1.26184066,  1.03321815,
        0.79693084])



In [227]:

    
y[0]









    Out[227]:





1.0160004999999999



In [246]:

    
err = (globalPred.mean(1)-y)**2
print(err.sum()/len(err))









    



13.2197419163



In [232]:

    
err = (globalPred.mean(1)-y)**2
print(err.sum()/len(err))









    



15.5247409167



In [242]:

    
for f in modelList:
    print(str(f).split("(")[0])
    err = (f.predict(XX)-y)**2
    print(err.sum()/len(err))









    



SVR
35.3708083921
KNeighborsRegressor
18.9510626637
ExtraTreesRegressor
11.4237345969
RandomForestRegressor
35.7356750905
XGBRegressor
17.3771183153
GradientBoostingRegressor
18.4407805712



In [266]:

    
err = (XX[:,0]-y)**2
print(err.sum()/len(err))









    



333.703190005



In [243]:

    
for f in modelList:
    print(str(f).split("(")[0])
    print(f.score(XX,y))









    



SVR
0.889897078226
KNeighborsRegressor
0.941008773482
ExtraTreesRegressor
0.964439982747
RandomForestRegressor
0.888761314262
XGBRegressor
0.945908177237
GradientBoostingRegressor
0.942597189237



In [ ]:

    
XX[:10,0] # feature 0 is marshall-palmer



In [238]:

    
svrMeta = SVR()



In [239]:

    
%%time
svrMeta = svrMeta.fit(globalPred,y)









    



CPU times: user 766 ms, sys: 12.4 ms, total: 778 ms
Wall time: 781 ms



In [241]:

    
err = (svrMeta.predict(globalPred)-y)**2
print(err.sum()/len(err))









    



288.194411898

Here for legacy



In [41]:

    
in_dim = len(XX[0])
out_dim = 1  

model = Sequential()
# Dense(64) is a fully-connected layer with 64 hidden units.
# in the first layer, you must specify the expected input data shape:
# here, 20-dimensional vectors.
model.add(Dense(128, input_shape=(in_dim,)))
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Dense(1, init='uniform'))
model.add(Activation('linear'))

#sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
#model.compile(loss='mean_squared_error', optimizer=sgd)

rms = RMSprop()
model.compile(loss='mean_squared_error', optimizer=rms)

#model.fit(X_train, y_train, nb_epoch=20, batch_size=16)
#score = model.evaluate(X_test, y_test, batch_size=16)



In [42]:

    
prep = []
for i in y_train:
    prep.append(min(i,20))



In [43]:

    
prep=np.array(prep)
mi,ma = prep.min(),prep.max()
fy = (prep-mi) / (ma-mi)
#my = fy.max()
#fy = fy/fy.max()



In [44]:

    
model.fit(np.array(X_train), fy, batch_size=10, nb_epoch=10, validation_split=0.1)









    



Train on 2224 samples, validate on 248 samples
Epoch 1/10
2224/2224 [==============================] - 0s - loss: 0.0628 - val_loss: 0.0506
Epoch 2/10
2224/2224 [==============================] - 0s - loss: 0.0537 - val_loss: 0.0509
Epoch 3/10
2224/2224 [==============================] - 0s - loss: 0.0517 - val_loss: 0.0541
Epoch 4/10
2224/2224 [==============================] - 0s - loss: 0.0520 - val_loss: 0.0500
Epoch 5/10
2224/2224 [==============================] - 0s - loss: 0.0519 - val_loss: 0.0525
Epoch 6/10
2224/2224 [==============================] - 0s - loss: 0.0523 - val_loss: 0.0489
Epoch 7/10
2224/2224 [==============================] - 0s - loss: 0.0517 - val_loss: 0.0551
Epoch 8/10
2224/2224 [==============================] - 0s - loss: 0.0512 - val_loss: 0.0498
Epoch 9/10
2224/2224 [==============================] - 0s - loss: 0.0516 - val_loss: 0.0583
Epoch 10/10
2224/2224 [==============================] - 0s - loss: 0.0516 - val_loss: 0.0497






    Out[44]:





<keras.callbacks.History at 0x112430c10>



In [45]:

    
pred = model.predict(np.array(X_test))*ma+mi



In [46]:

    
err = (pred-y_test)**2
err.sum()/len(err)









    Out[46]:





182460.82171163053



In [ ]:

    
r = random.randrange(len(X_train))
print("(Train) Prediction %0.4f, True: %0.4f"%(model.predict(np.array([X_train[r]]))[0][0]*ma+mi,y_train[r]))

r = random.randrange(len(X_test))
print("(Test)  Prediction %0.4f, True: %0.4f"%(model.predict(np.array([X_test[r]]))[0][0]*ma+mi,y_test[r]))

Predict on testset



In [338]:

    
%%time
filename = "data/reduced_test_5000.csv"
#filename = "data/test.csv"
test = pd.read_csv(filename)
test = test.set_index('Id')









    



CPU times: user 10.3 ms, sys: 2.9 ms, total: 13.2 ms
Wall time: 12.4 ms



In [339]:

    
features_columns = list([u'Ref', u'Ref_5x5_10th',
       u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
       u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
       u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
       u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
       u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
       u'Kdp_5x5_50th', u'Kdp_5x5_90th'])

def getX(raw):
    selected_columns = list([ u'minutes_past',u'radardist_km', u'Ref', u'Ref_5x5_10th',
       u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
       u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
       u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
       u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
       u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
       u'Kdp_5x5_50th', u'Kdp_5x5_90th'])
    
    data = raw[selected_columns]
    
    docX= []
    for i in data.index.unique():
        if isinstance(data.loc[i],pd.core.series.Series):
            m = [data.loc[i].as_matrix()]
            docX.append(m)
        else:
            m = data.loc[i].as_matrix()
            docX.append(m)
    X = np.array(docX)
    return X



In [340]:

    
#%%time
#X=getX(test)

#tmp = []
#for i in X:
#    tmp.append(len(i))
#tmp = np.array(tmp)
#sns.countplot(tmp,order=range(tmp.min(),tmp.max()+1))
#plt.title("Number of ID per number of observations\n(On test dataset)")
#plt.plot()



In [341]:

    
#testFull = test.dropna()
testNoFullNan = test.loc[test[features_columns].dropna(how='all').index.unique()]



In [342]:

    
%%time
X=getX(testNoFullNan)  # 1min
#XX = [np.array(t).mean(0) for t in X]  # 10s









    



CPU times: user 107 ms, sys: 2.27 ms, total: 109 ms
Wall time: 110 ms



In [ ]:



In [343]:

    
XX=[]
for t in X:
    nm = np.nanmean(t,0)
    for idx,j in enumerate(nm):
        if np.isnan(j):
            nm[idx]=global_means[idx]
    XX.append(nm)
XX=np.array(XX)

# rescale to clip min at 0 (for non negative matrix factorization)
XX_rescaled=XX[:,:]-np.min(XX,0)



In [344]:

    
%%time
W = nmf.transform(XX_rescaled)









    



CPU times: user 11.7 ms, sys: 2.26 ms, total: 13.9 ms
Wall time: 13.3 ms



In [345]:

    
XX=addFeatures(X,mf=W)



In [346]:

    
pd.DataFrame(xgbr.predict(XX)).describe()



In [348]:

    
reducedModelList = [knn,etreg,xgbr,gbr]
globalPred = np.array([f.predict(XX) for f in reducedModelList]).T
predTest = globalPred.mean(1)



In [100]:

    
predFull = zip(testNoFullNan.index.unique(),predTest)



In [101]:

    
testNan = test.drop(test[features_columns].dropna(how='all').index)



In [ ]:

    
pred = predFull + predNan



In [102]:

    
tmp = np.empty(len(testNan))
tmp.fill(0.445000)   # 50th percentile of full Nan dataset
predNan = zip(testNan.index.unique(),tmp)



In [103]:

    
testLeft = test.drop(testNan.index.unique()).drop(testFull.index.unique())



In [104]:

    
tmp = np.empty(len(testLeft))
tmp.fill(1.27)   # 50th percentile of full Nan dataset
predLeft = zip(testLeft.index.unique(),tmp)



In [105]:

    
len(testFull.index.unique())









    Out[105]:





235515



In [106]:

    
len(testNan.index.unique())









    Out[106]:





232148



In [107]:

    
len(testLeft.index.unique())









    Out[107]:





249962



In [108]:

    
pred = predFull + predNan + predLeft



In [113]:

    
pred.sort(key=lambda x: x[0], reverse=False)



In [ ]:

    
#reducedModelList = [knn,etreg,xgbr,gbr]
globalPred = np.array([f.predict(XX) for f in reducedModelList]).T
#globalPred.mean(1)



In [114]:

    
submission = pd.DataFrame(pred)
submission.columns = ["Id","Expected"]
submission.head()



In [115]:

    
submission.loc[submission['Expected']<0,'Expected'] = 0.445



In [116]:

    
submission.to_csv("submit4.csv",index=False)



In [ ]:



In [73]:

    
filename = "data/sample_solution.csv"
sol = pd.read_csv(filename)



In [74]:

    
sol









    Out[74]:






  
    
      
      Id
      Expected
    
  
  
    
      0
      1
      0.085765
    
    
      1
      2
      0.000000
    
    
      2
      3
      1.594004
    
    
      3
      4
      6.913380
    
    
      4
      5
      0.000000
    
    
      5
      6
      0.173935
    
    
      6
      7
      3.219921
    
    
      7
      8
      0.867394
    
    
      8
      9
      0.000000
    
    
      9
      10
      14.182371
    
    
      10
      11
      0.911013
    
    
      11
      12
      0.034835
    
    
      12
      13
      2.733501
    
    
      13
      14
      0.709341
    
    
      14
      15
      0.000000
    
    
      15
      16
      0.000000
    
    
      16
      17
      0.000000
    
    
      17
      18
      0.393315
    
    
      18
      19
      0.291799
    
    
      19
      20
      0.000000
    
    
      20
      21
      0.000000
    
    
      21
      22
      2.031596
    
    
      22
      23
      2.236317
    
    
      23
      24
      0.000000
    
    
      24
      25
      0.000000
    
    
      25
      26
      0.009609
    
    
      26
      27
      0.020464
    
    
      27
      28
      0.000000
    
    
      28
      29
      0.000000
    
    
      29
      30
      3.074862
    
    
      ...
      ...
      ...
    
    
      717595
      717596
      2.573862
    
    
      717596
      717597
      0.000000
    
    
      717597
      717598
      6.580513
    
    
      717598
      717599
      0.270776
    
    
      717599
      717600
      0.177539
    
    
      717600
      717601
      1.133367
    
    
      717601
      717602
      0.000000
    
    
      717602
      717603
      9.055868
    
    
      717603
      717604
      0.633148
    
    
      717604
      717605
      17.524065
    
    
      717605
      717606
      0.000000
    
    
      717606
      717607
      0.000000
    
    
      717607
      717608
      0.000000
    
    
      717608
      717609
      14.180502
    
    
      717609
      717610
      1.387969
    
    
      717610
      717611
      0.000000
    
    
      717611
      717612
      0.527286
    
    
      717612
      717613
      0.000000
    
    
      717613
      717614
      0.164476
    
    
      717614
      717615
      2.652251
    
    
      717615
      717616
      0.302655
    
    
      717616
      717617
      0.183060
    
    
      717617
      717618
      0.142695
    
    
      717618
      717619
      0.000000
    
    
      717619
      717620
      0.343296
    
    
      717620
      717621
      0.064034
    
    
      717621
      717622
      0.000000
    
    
      717622
      717623
      1.090277
    
    
      717623
      717624
      1.297023
    
    
      717624
      717625
      0.000000
    
  

717625 rows × 2 columns



In [ ]:

    
ss = np.array(sol)



In [ ]:

    
%%time
for a,b in predFull:
    ss[a-1][1]=b



In [ ]:

    
ss



In [75]:

    
sub = pd.DataFrame(pred)
sub.columns = ["Id","Expected"]
sub.Id = sub.Id.astype(int)
sub.head()



In [76]:

    
sub.to_csv("submit3.csv",index=False)



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:

	Id	Expected
0	1	1.270000
1	2	1.270000
2	3	2.361996
3	4	14.492731
4	5	0.445000

	Id	Expected
0	1	1.270000
1	2	1.270000
2	3	2.378660
3	4	8.851727
4	5	0.445000

	minutes_past	radardist_km	Ref	Ref_5x5_10th	Ref_5x5_50th	Ref_5x5_90th	RefComposite	RefComposite_5x5_10th	RefComposite_5x5_50th	RefComposite_5x5_90th	...	RhoHV_5x5_90th	Zdr	Zdr_5x5_10th	Zdr_5x5_50th	Zdr_5x5_90th	Kdp	Kdp_5x5_10th	Kdp_5x5_50th	Kdp_5x5_90th	Expected
Id
1	3	10	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.254
1	16	10	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.254
1	25	10	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.254
1	35	10	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.254
1	45	10	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.254

	minutes_past	radardist_km	Ref	Ref_5x5_10th	Ref_5x5_50th	Ref_5x5_90th	RefComposite	RefComposite_5x5_10th	RefComposite_5x5_50th	RefComposite_5x5_90th	...	RhoHV_5x5_90th	Zdr	Zdr_5x5_10th	Zdr_5x5_50th	Zdr_5x5_90th	Kdp	Kdp_5x5_10th	Kdp_5x5_50th	Kdp_5x5_90th	Expected
count	93759.000000	93759.000000	44923.000000	38391.000000	45078.000000	53038.000000	47946.000000	42219.000000	48042.000000	55118.000000	...	42064.000000	35938.000000	30835.000000	35925.000000	42064.000000	31231.000000	26418.000000	31283.000000	36505.000000	93759.000000
mean	29.684830	11.022334	23.684482	20.788948	23.378688	26.427731	25.424874	22.956797	25.139201	27.982365	...	1.014742	0.597837	-0.564851	0.429577	2.018197	-0.013098	-3.383198	-0.429909	3.855601	5.837679
std	17.418876	4.259865	10.224306	9.073503	9.936862	11.186952	10.627954	9.638466	10.372406	11.535892	...	0.045336	1.388384	0.974288	0.864887	1.539513	3.747791	2.771442	2.194894	3.762005	22.764656
min	0.000000	0.000000	-29.000000	-31.500000	-31.500000	-26.500000	-26.500000	-27.500000	-25.000000	-23.000000	...	0.208333	-7.875000	-7.875000	-7.875000	-7.875000	-52.880005	-51.420000	-46.870010	-41.540010	0.010000
25%	15.000000	9.000000	17.500000	15.500000	17.500000	19.500000	19.000000	17.500000	18.500000	20.500000	...	0.998333	-0.062500	-1.000000	0.062500	1.125000	-1.410004	-4.230011	-0.710007	1.759994	0.254000
50%	30.000000	12.000000	24.000000	21.000000	23.500000	27.000000	25.500000	23.000000	25.500000	28.500000	...	1.005000	0.500000	-0.500000	0.375000	1.687500	0.000000	-2.809998	0.000000	3.169998	0.830000
75%	45.000000	14.000000	30.500000	27.000000	30.500000	34.500000	33.000000	29.500000	32.500000	36.500000	...	1.051667	1.125000	0.000000	0.750000	2.500000	1.409988	-1.740006	0.349991	5.289993	2.794001
max	59.000000	21.000000	64.500000	57.000000	61.500000	67.500000	68.000000	59.500000	64.000000	79.500000	...	1.051667	7.937500	5.937500	7.937500	7.937500	47.849990	1.759994	5.629990	43.209990	244.000120

	0
count	270.000000
mean	39.863171
std	19.231586
min	12.168386
25%	25.052711
50%	32.528101
75%	56.126141
max	121.855644

	Id	Expected
0	1	0.085765
1	2	0.000000
2	3	1.594004
3	4	6.913380
4	5	0.000000
5	6	0.173935
6	7	3.219921
7	8	0.867394
8	9	0.000000
9	10	14.182371
10	11	0.911013
11	12	0.034835
12	13	2.733501
13	14	0.709341
14	15	0.000000
15	16	0.000000
16	17	0.000000
17	18	0.393315
18	19	0.291799
19	20	0.000000
20	21	0.000000
21	22	2.031596
22	23	2.236317
23	24	0.000000
24	25	0.000000
25	26	0.009609
26	27	0.020464
27	28	0.000000
28	29	0.000000
29	30	3.074862
...	...	...
717595	717596	2.573862
717596	717597	0.000000
717597	717598	6.580513
717598	717599	0.270776
717599	717600	0.177539
717600	717601	1.133367
717601	717602	0.000000
717602	717603	9.055868
717603	717604	0.633148
717604	717605	17.524065
717605	717606	0.000000
717606	717607	0.000000
717607	717608	0.000000
717608	717609	14.180502
717609	717610	1.387969
717610	717611	0.000000
717611	717612	0.527286
717612	717613	0.000000
717613	717614	0.164476
717614	717615	2.652251
717615	717616	0.302655
717616	717617	0.183060
717617	717618	0.142695
717618	717619	0.000000
717619	717620	0.343296
717620	717621	0.064034
717621	717622	0.000000
717622	717623	1.090277
717623	717624	1.297023
717624	717625	0.000000