FDMS TME3

Florian Toque & Paul Willot

Notes

We tried different model, like SVM regression, MLP, Random Forest and KNN as recommanded by the winning team of the Kaggle on taxi trajectories. So far Random Forest seems to be the best, slightly better than the SVM.
The new features we exctracted only made a very small impact on predictions.



In [97]:

    
# from __future__ import exam_success
from __future__ import absolute_import
from __future__ import print_function

%matplotlib inline
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import random
import pandas as pd
import scipy.stats as stats

# Sk cheatsfrom sklearn.ensemble import ExtraTreesRegressor
from sklearn.cross_validation import cross_val_score  # cross val
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.preprocessing import Imputer   # get rid of nan
from sklearn.neighbors import KNeighborsRegressor
from sklearn import grid_search
import os

13.765.202 lines in train.csv
8.022.757 lines in test.csv

Few words about the dataset

Predictions is made in the USA corn growing states (mainly Iowa, Illinois, Indiana) during the season with the highest rainfall (as illustrated by Iowa for the april to august months)

The Kaggle page indicate that the dataset have been shuffled, so working on a subset seems acceptable
The test set is not a extracted from the same data as the training set however, which make the evaluation trickier

Load the dataset



In [111]:

    
%%time
#filename = "data/train.csv"
filename = "data/reduced_train_10000.csv"
#filename = "data/reduced_train_1000000.csv"
raw = pd.read_csv(filename)
raw = raw.set_index('Id')









    



CPU times: user 26.2 ms, sys: 6.82 ms, total: 33 ms
Wall time: 35.2 ms



In [112]:

    
raw.columns









    Out[112]:





Index([u'minutes_past', u'radardist_km', u'Ref', u'Ref_5x5_10th',
       u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
       u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
       u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
       u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
       u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
       u'Kdp_5x5_50th', u'Kdp_5x5_90th', u'Expected'],
      dtype='object')



In [110]:

    
testFull.columns









    Out[110]:





Index([u'minutes_past', u'radardist_km', u'Ref', u'Ref_5x5_10th',
       u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
       u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
       u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
       u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
       u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
       u'Kdp_5x5_50th', u'Kdp_5x5_90th'],
      dtype='object')



In [3]:

    
raw['Expected'].describe()









    Out[3]:





count    100000.000000
mean        129.579825
std         687.622542
min           0.010000
25%           0.254000
50%           1.016000
75%           3.556002
max       32740.617000
Name: Expected, dtype: float64

Per wikipedia, a value of more than 421 mm/h is considered "Extreme/large hail"
If we encounter the value 327.40 meter per hour, we should probably start building Noah's ark
Therefor, it seems reasonable to drop values too large, considered as outliers



In [87]:

    
# Considering that the gauge may concentrate the rainfall, we set the cap to 1000
# Comment this line to analyse the complete dataset 
l = len(raw)
raw = raw[raw['Expected'] < 300]  #1000
print("Dropped %d (%0.2f%%)"%(l-len(raw),(l-len(raw))/float(l)*100))









    



Dropped 51720 (5.17%)



In [5]:

    
raw.head(5)









    Out[5]:






  
    
      
      minutes_past
      radardist_km
      Ref
      Ref_5x5_10th
      Ref_5x5_50th
      Ref_5x5_90th
      RefComposite
      RefComposite_5x5_10th
      RefComposite_5x5_50th
      RefComposite_5x5_90th
      ...
      RhoHV_5x5_90th
      Zdr
      Zdr_5x5_10th
      Zdr_5x5_50th
      Zdr_5x5_90th
      Kdp
      Kdp_5x5_10th
      Kdp_5x5_50th
      Kdp_5x5_90th
      Expected
    
    
      Id
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      3
      10
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      0.254
    
    
      1
      16
      10
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      0.254
    
    
      1
      25
      10
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      0.254
    
    
      1
      35
      10
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      0.254
    
    
      1
      45
      10
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      0.254
    
  

5 rows × 23 columns



In [6]:

    
raw.describe()









    Out[6]:






  
    
      
      minutes_past
      radardist_km
      Ref
      Ref_5x5_10th
      Ref_5x5_50th
      Ref_5x5_90th
      RefComposite
      RefComposite_5x5_10th
      RefComposite_5x5_50th
      RefComposite_5x5_90th
      ...
      RhoHV_5x5_90th
      Zdr
      Zdr_5x5_10th
      Zdr_5x5_50th
      Zdr_5x5_90th
      Kdp
      Kdp_5x5_10th
      Kdp_5x5_50th
      Kdp_5x5_90th
      Expected
    
  
  
    
      count
      93759.000000
      93759.000000
      44923.000000
      38391.000000
      45078.000000
      53038.000000
      47946.000000
      42219.000000
      48042.000000
      55118.000000
      ...
      42064.000000
      35938.000000
      30835.000000
      35925.000000
      42064.000000
      31231.000000
      26418.000000
      31283.000000
      36505.000000
      93759.000000
    
    
      mean
      29.684830
      11.022334
      23.684482
      20.788948
      23.378688
      26.427731
      25.424874
      22.956797
      25.139201
      27.982365
      ...
      1.014742
      0.597837
      -0.564851
      0.429577
      2.018197
      -0.013098
      -3.383198
      -0.429909
      3.855601
      5.837679
    
    
      std
      17.418876
      4.259865
      10.224306
      9.073503
      9.936862
      11.186952
      10.627954
      9.638466
      10.372406
      11.535892
      ...
      0.045336
      1.388384
      0.974288
      0.864887
      1.539513
      3.747791
      2.771442
      2.194894
      3.762005
      22.764656
    
    
      min
      0.000000
      0.000000
      -29.000000
      -31.500000
      -31.500000
      -26.500000
      -26.500000
      -27.500000
      -25.000000
      -23.000000
      ...
      0.208333
      -7.875000
      -7.875000
      -7.875000
      -7.875000
      -52.880005
      -51.420000
      -46.870010
      -41.540010
      0.010000
    
    
      25%
      15.000000
      9.000000
      17.500000
      15.500000
      17.500000
      19.500000
      19.000000
      17.500000
      18.500000
      20.500000
      ...
      0.998333
      -0.062500
      -1.000000
      0.062500
      1.125000
      -1.410004
      -4.230011
      -0.710007
      1.759994
      0.254000
    
    
      50%
      30.000000
      12.000000
      24.000000
      21.000000
      23.500000
      27.000000
      25.500000
      23.000000
      25.500000
      28.500000
      ...
      1.005000
      0.500000
      -0.500000
      0.375000
      1.687500
      0.000000
      -2.809998
      0.000000
      3.169998
      0.830000
    
    
      75%
      45.000000
      14.000000
      30.500000
      27.000000
      30.500000
      34.500000
      33.000000
      29.500000
      32.500000
      36.500000
      ...
      1.051667
      1.125000
      0.000000
      0.750000
      2.500000
      1.409988
      -1.740006
      0.349991
      5.289993
      2.794001
    
    
      max
      59.000000
      21.000000
      64.500000
      57.000000
      61.500000
      67.500000
      68.000000
      59.500000
      64.000000
      79.500000
      ...
      1.051667
      7.937500
      5.937500
      7.937500
      7.937500
      47.849990
      1.759994
      5.629990
      43.209990
      244.000120
    
  

8 rows × 23 columns

We regroup the data by ID



In [7]:

    
# We select all features except for the minutes past,
# because we ignore the time repartition of the sequence for now

features_columns = list([u'Ref', u'Ref_5x5_10th',
       u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
       u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
       u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
       u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
       u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
       u'Kdp_5x5_50th', u'Kdp_5x5_90th'])

def getXy(raw):
    selected_columns = list([ u'minutes_past',u'radardist_km', u'Ref', u'Ref_5x5_10th',
       u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
       u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
       u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
       u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
       u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
       u'Kdp_5x5_50th', u'Kdp_5x5_90th'])
    
    data = raw[selected_columns]
    
    docX, docY = [], []
    for i in data.index.unique():
        if isinstance(data.loc[i],pd.core.series.Series):
            m = [data.loc[i].as_matrix()]
            docX.append(m)
            docY.append(float(raw.loc[i]["Expected"]))
        else:
            m = data.loc[i].as_matrix()
            docX.append(m)
            docY.append(float(raw.loc[i][:1]["Expected"]))
    X , y = np.array(docX) , np.array(docY)
    return X,y

On fully filled dataset



In [88]:

    
#noAnyNan = raw.loc[raw[features_columns].dropna(how='any').index.unique()]
noAnyNan = raw.dropna()



In [89]:

    
noFullNan = raw.loc[raw[features_columns].dropna(how='all').index.unique()]



In [90]:

    
fullNan = raw.drop(raw[features_columns].dropna(how='all').index)

Predicitons

As a first try, we make predictions on the complete data, and return the 50th percentile and uncomplete and fully empty data



In [91]:

    
%%time
X,y=getXy(noAnyNan)









    



CPU times: user 17.5 s, sys: 127 ms, total: 17.6 s
Wall time: 17.7 s



In [12]:

    
%%time
#XX = [np.array(t).mean(0) for t in X]
XX = [np.append(np.nanmean(np.array(t),0),(np.array(t)[1:] - np.array(t)[:-1]).sum(0) ) for t in X]









    



CPU times: user 175 ms, sys: 3.33 ms, total: 178 ms
Wall time: 179 ms



In [92]:

    
%%time
XX=[]
for t in X:
    #print(idx)
    
    tmp = np.append(np.nanmean(np.array(t),0),(np.array(t)[1:] - np.array(t)[:-1]).sum(0) )
    tmp = np.append(tmp,np.percentile(t,10,axis=0))
    tmp = np.append(tmp,np.percentile(t,50,axis=0))
    tmp = np.append(tmp,np.percentile(t,90,axis=0))

    test = t
    
    try:
        taa=test[:,0]
    except TypeError:
        taa=[test[0][0]]
    valid_time = np.zeros_like(taa)
    valid_time[0] = taa[0]
    for n in xrange(1,len(taa)):
        valid_time[n] = taa[n] - taa[n-1]
    valid_time[-1] = valid_time[-1] + 60 - np.sum(valid_time)
    valid_time = valid_time / 60.0

    
    sum=0
    try:
        column_ref=test[:,2]
    except TypeError:
        column_ref=[test[0][2]]
    for dbz, hours in zip(column_ref, valid_time):
        # See: https://en.wikipedia.org/wiki/DBZ_(meteorology)
        if np.isfinite(dbz):
            mmperhr = pow(pow(10, dbz/10)/200, 0.625)
            sum = sum + mmperhr * hours
            
    XX.append(np.append(np.array(sum),tmp))
    #XX.append(np.array([sum]))
    #XX.append(tmp)









    



CPU times: user 8.67 s, sys: 36.9 ms, total: 8.7 s
Wall time: 8.73 s



In [14]:

    
XX[2]









    Out[14]:





array([  5.99821681e-01,   2.94285714e+01,   1.20000000e+01,
         1.92857143e+01,   1.73571429e+01,   1.99285714e+01,
         2.30714286e+01,   1.92857143e+01,   1.73571429e+01,
         1.99285714e+01,   2.30714286e+01,   9.85476206e-01,
         9.68333350e-01,   9.91190490e-01,   1.01166668e+00,
         4.82142857e-01,  -4.28571429e-01,   1.96428571e-01,
         9.64285714e-01,  -1.45000351e+00,  -2.45857240e+00,
        -1.50002620e-01,   2.46570911e+00,   5.70000000e+01,
         0.00000000e+00,  -6.00000000e+00,  -1.50000000e+00,
        -1.50000000e+00,  -5.00000000e-01,  -6.00000000e+00,
        -1.50000000e+00,  -1.50000000e+00,  -5.00000000e-01,
         3.00000000e-02,   2.00000600e-02,   1.33333000e-02,
        -6.66670000e-03,   1.50000000e+00,   0.00000000e+00,
        -6.25000000e-02,   1.87500000e-01,   6.99997000e-01,
        -1.40998840e+00,  -1.41000370e+00,  -2.11999510e+00,
         6.40000000e+00,   1.20000000e+01,   1.59000000e+01,
         1.59000000e+01,   1.85000000e+01,   2.18000000e+01,
         1.59000000e+01,   1.59000000e+01,   1.85000000e+01,
         2.18000000e+01,   9.65666672e-01,   9.58333338e-01,
         9.81666700e-01,   1.00033335e+00,  -1.75000000e-01,
        -6.50000000e-01,   1.00000000e-01,   6.75000000e-01,
        -3.16999820e+00,  -3.16999820e+00,  -1.05999760e+00,
         1.75999450e+00,   2.90000000e+01,   1.20000000e+01,
         2.00000000e+01,   1.65000000e+01,   1.95000000e+01,
         2.30000000e+01,   2.00000000e+01,   1.65000000e+01,
         1.95000000e+01,   2.30000000e+01,   9.88333340e-01,
         9.68333360e-01,   9.95000000e-01,   1.00833330e+00,
         5.62500000e-01,  -4.37500000e-01,   1.87500000e-01,
         9.37500000e-01,  -1.34001160e+00,  -2.47000120e+00,
        -3.50006100e-01,   2.11999500e+00,   5.26000000e+01,
         1.20000000e+01,   2.20000000e+01,   1.96000000e+01,
         2.18000000e+01,   2.44000000e+01,   2.20000000e+01,
         1.96000000e+01,   2.18000000e+01,   2.44000000e+01,
         1.00166668e+00,   9.77666680e-01,   9.96333332e-01,
         1.02700000e+00,   1.21250000e+00,  -2.37500000e-01,
         2.75000000e-01,   1.27500000e+00,  -1.42007436e-01,
        -1.60000612e+00,   8.49993916e-01,   3.65799254e+00])

(X[0][1:] - X[0][:-1]).sum(0) (t[1:] - t[:-1]).sum(0) %%time etreg = etreg.fit(XX,y)



In [93]:

    
def splitTrainTest(X, y, split=0.2):
    tmp1, tmp2 = [], []
    ps = int(len(X) * (1-split))
    index_shuf = range(len(X))
    random.shuffle(index_shuf)
    for i in index_shuf:
        tmp1.append(X[i])
        tmp2.append(y[i])
    return tmp1[:ps], tmp2[:ps], tmp1[ps:], tmp2[ps:]



In [94]:

    
X_train,y_train, X_test, y_test = splitTrainTest(XX,y)



In [22]:

    
def manualScorer(estimator, X, y):
    err = (estimator.predict(X_test)-y_test)**2
    return -err.sum()/len(err)

max prof 24 nb trees 84 min sample per leaf 17 min sample to split 51



In [21]:

    
from sklearn import svm



In [18]:

    
svr = svm.SVR(C=100000)



In [ ]:

    
%%time
srv = svr.fit(X_train,y_train)



In [ ]:

    
err = (svr.predict(X_train)-y_train)**2
err.sum()/len(err)



In [ ]:

    
err = (svr.predict(X_test)-y_test)**2
err.sum()/len(err)



In [25]:

    
%%time
svr_score = cross_val_score(svr, XX, y, cv=5)
print("Score: %s\nMean: %.03f"%(svr_score,svr_score.mean()))









    



Score: [ 0.52865286  0.91093893  0.8215682   0.87335142  0.21272327]
Mean: 0.669
CPU times: user 6.05 s, sys: 53.1 ms, total: 6.1 s
Wall time: 6.16 s



In [27]:

    
knn = KNeighborsRegressor(n_neighbors=6,weights='distance',algorithm='ball_tree')



In [28]:

    
#parameters = {'weights':('distance','uniform'),'algorithm':('auto', 'ball_tree', 'kd_tree', 'brute')}
parameters = {'n_neighbors':range(1,10,1)}
grid_knn = grid_search.GridSearchCV(knn, parameters,scoring=manualScorer)



In [29]:

    
%%time
grid_knn.fit(X_train,y_train)









    



CPU times: user 1.26 s, sys: 4.03 ms, total: 1.27 s
Wall time: 1.27 s






    Out[29]:





GridSearchCV(cv=None, error_score='raise',
       estimator=KNeighborsRegressor(algorithm='ball_tree', leaf_size=30, metric='minkowski',
          metric_params=None, n_neighbors=6, p=2, weights='distance'),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9]},
       pre_dispatch='2*n_jobs', refit=True, score_func=None,
       scoring=<function manualScorer at 0x110aa0ed8>, verbose=0)



In [30]:

    
print(grid_knn.grid_scores_)
print("Best: ",grid_knn.best_params_)









    



[mean: -104.53917, std: 46.90261, params: {'n_neighbors': 1}, mean: -91.09434, std: 35.17553, params: {'n_neighbors': 2}, mean: -101.84580, std: 49.26225, params: {'n_neighbors': 3}, mean: -105.42949, std: 71.07642, params: {'n_neighbors': 4}, mean: -103.53711, std: 77.01319, params: {'n_neighbors': 5}, mean: -95.13442, std: 69.23005, params: {'n_neighbors': 6}, mean: -91.41373, std: 65.75289, params: {'n_neighbors': 7}, mean: -86.27320, std: 61.05561, params: {'n_neighbors': 8}, mean: -82.19850, std: 57.05147, params: {'n_neighbors': 9}]
Best:  {'n_neighbors': 9}



In [31]:

    
knn = grid_knn.best_estimator_



In [32]:

    
knn= knn.fit(X_train,y_train)



In [33]:

    
err = (knn.predict(X_train)-y_train)**2
err.sum()/len(err)









    Out[33]:





2.1117743915936922



In [34]:

    
err = (knn.predict(X_test)-y_test)**2
err.sum()/len(err)









    Out[34]:





42.269848400778471



In [35]:

    
etreg = ExtraTreesRegressor(n_estimators=200, max_depth=None, min_samples_split=1, random_state=0)



In [36]:

    
parameters = {'n_estimators':range(100,200,20)}
grid_rf = grid_search.GridSearchCV(etreg, parameters,n_jobs=2,scoring=manualScorer)



In [37]:

    
%%time
grid_rf.fit(X_train,y_train)









    



CPU times: user 4.44 s, sys: 241 ms, total: 4.68 s
Wall time: 16.3 s






    Out[37]:





GridSearchCV(cv=None, error_score='raise',
       estimator=ExtraTreesRegressor(bootstrap=False, criterion='mse', max_depth=None,
          max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
          min_samples_split=1, min_weight_fraction_leaf=0.0,
          n_estimators=200, n_jobs=1, oob_score=False, random_state=0,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, loss_func=None, n_jobs=2,
       param_grid={'n_estimators': [100, 120, 140, 160, 180]},
       pre_dispatch='2*n_jobs', refit=True, score_func=None,
       scoring=<function manualScorer at 0x110aa0ed8>, verbose=0)



In [38]:

    
print(grid_rf.grid_scores_)
print("Best: ",grid_rf.best_params_)









    



[mean: -55.73522, std: 40.35044, params: {'n_estimators': 100}, mean: -55.47051, std: 39.85010, params: {'n_estimators': 120}, mean: -56.18434, std: 40.62698, params: {'n_estimators': 140}, mean: -56.15046, std: 40.74838, params: {'n_estimators': 160}, mean: -56.37052, std: 40.72395, params: {'n_estimators': 180}]
Best:  {'n_estimators': 120}



In [39]:

    
grid_rf.best_params_









    Out[39]:





{'n_estimators': 120}



In [40]:

    
es = etreg
#es = grid_rf.best_estimator_



In [41]:

    
%%time
es = es.fit(X_train,y_train)









    



CPU times: user 3.4 s, sys: 16.4 ms, total: 3.41 s
Wall time: 3.42 s



In [42]:

    
err = (es.predict(X_train)-y_train)**2
err.sum()/len(err)









    Out[42]:





2.1117743915936917



In [43]:

    
err = (es.predict(X_test)-y_test)**2
err.sum()/len(err)









    Out[43]:





23.042852064874744



In [98]:

    
gbr = GradientBoostingRegressor(loss='ls', learning_rate=0.1, n_estimators=900,
                                subsample=1.0, min_samples_split=2, min_samples_leaf=1,
                                min_weight_fraction_leaf=0.0, max_depth=4, init=None,
                                random_state=None, max_features=None, alpha=0.5,
                                verbose=0, max_leaf_nodes=None, warm_start=False)



In [99]:

    
%%time
gbr = gbr.fit(X_train,y_train)
os.system('say "終わりだ"') #its over!









    



CPU times: user 4min 48s, sys: 653 ms, total: 4min 49s
Wall time: 4min 51s



In [80]:

    
#parameters = {'max_depth':range(2,5,1),'alpha':[0.5,0.6,0.7,0.8,0.9]}
#parameters = {'subsample':[0.2,0.4,0.5,0.6,0.8,1]}
#parameters = {'subsample':[0.2,0.5,0.6,0.8,1],'n_estimators':[800,1000,1200]}
#parameters = {'max_depth':range(2,4,1)}
#parameters = {'n_estimators':[600,800,900]}
#parameters = {'loss':['ls', 'lad', 'huber', 'quantile'],'alpha':[0.3,0.5,0.8,0.9]}

grid_gbr = grid_search.GridSearchCV(gbr, parameters,n_jobs=2,scoring=manualScorer)



In [81]:

    
%%time
grid_gbr = grid_gbr.fit(X_train,y_train)









    



CPU times: user 16.4 s, sys: 1.16 s, total: 17.5 s
Wall time: 2min 27s



In [82]:

    
print(grid_gbr.grid_scores_)
print("Best: ",grid_gbr.best_params_)









    



[mean: -42.96759, std: 2.26566, params: {'alpha': 0.3, 'loss': 'ls'}, mean: -112.57348, std: 2.89117, params: {'alpha': 0.3, 'loss': 'lad'}, mean: -100.31314, std: 6.08331, params: {'alpha': 0.3, 'loss': 'huber'}, mean: -130.80967, std: 0.93008, params: {'alpha': 0.3, 'loss': 'quantile'}, mean: -42.74488, std: 7.80268, params: {'alpha': 0.5, 'loss': 'ls'}, mean: -114.97495, std: 1.49172, params: {'alpha': 0.5, 'loss': 'lad'}, mean: -80.46645, std: 6.05650, params: {'alpha': 0.5, 'loss': 'huber'}, mean: -116.54107, std: 4.52786, params: {'alpha': 0.5, 'loss': 'quantile'}, mean: -48.71631, std: 9.54319, params: {'alpha': 0.8, 'loss': 'ls'}, mean: -114.75080, std: 2.35584, params: {'alpha': 0.8, 'loss': 'lad'}, mean: -51.93217, std: 9.86208, params: {'alpha': 0.8, 'loss': 'huber'}, mean: -85.26825, std: 3.29395, params: {'alpha': 0.8, 'loss': 'quantile'}, mean: -45.76086, std: 2.27370, params: {'alpha': 0.9, 'loss': 'ls'}, mean: -117.01689, std: 8.90194, params: {'alpha': 0.9, 'loss': 'lad'}, mean: -44.38461, std: 7.82885, params: {'alpha': 0.9, 'loss': 'huber'}, mean: -102.06466, std: 7.07691, params: {'alpha': 0.9, 'loss': 'quantile'}]
Best:  {'alpha': 0.5, 'loss': 'ls'}



In [85]:

    
err = (gbr.predict(X_train)-y_train)**2
print(err.sum()/len(err))
err = (gbr.predict(X_test)-y_test)**2
print(err.sum()/len(err))









    



1.82991003151
23.8878088993



In [100]:

    
err = (gbr.predict(X_train)-y_train)**2
print(err.sum()/len(err))
err = (gbr.predict(X_test)-y_test)**2
print(err.sum()/len(err))









    



27.325739932
123.44709784



In [ ]:

    
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD

in_dim = len(XX[0])
out_dim = 1  

model = Sequential()
# Dense(64) is a fully-connected layer with 64 hidden units.
# in the first layer, you must specify the expected input data shape:
# here, 20-dimensional vectors.
model.add(Dense(128, input_shape=(in_dim,)))
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Dense(1, init='uniform'))
model.add(Activation('linear'))

#sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
#model.compile(loss='mean_squared_error', optimizer=sgd)

rms = RMSprop()
model.compile(loss='mean_squared_error', optimizer=rms)

#model.fit(X_train, y_train, nb_epoch=20, batch_size=16)
#score = model.evaluate(X_test, y_test, batch_size=16)



In [ ]:

    
prep = []
for i in y_train:
    prep.append(min(i,20))



In [ ]:

    
prep=np.array(prep)
mi,ma = prep.min(),prep.max()
fy = (prep-mi) / (ma-mi)
#my = fy.max()
#fy = fy/fy.max()



In [ ]:

    
model.fit(np.array(X_train), fy, batch_size=10, nb_epoch=10, validation_split=0.1)



In [ ]:

    
pred = model.predict(np.array(X_test))*ma+mi



In [ ]:

    
err = (pred-y_test)**2
err.sum()/len(err)



In [ ]:

    
r = random.randrange(len(X_train))
print("(Train) Prediction %0.4f, True: %0.4f"%(model.predict(np.array([X_train[r]]))[0][0]*ma+mi,y_train[r]))

r = random.randrange(len(X_test))
print("(Test)  Prediction %0.4f, True: %0.4f"%(model.predict(np.array([X_test[r]]))[0][0]*ma+mi,y_test[r]))



In [ ]:

    
def marshall_palmer(ref, minutes_past):
    #print("Estimating rainfall from {0} observations".format(len(minutes_past)))
    # how long is each observation valid?
    valid_time = np.zeros_like(minutes_past)
    valid_time[0] = minutes_past.iloc[0]
    for n in xrange(1, len(minutes_past)):
        valid_time[n] = minutes_past.iloc[n] - minutes_past.iloc[n-1]
    valid_time[-1] = valid_time[-1] + 60 - np.sum(valid_time)
    valid_time = valid_time / 60.0

    # sum up rainrate * validtime
    sum = 0
    for dbz, hours in zip(ref, valid_time):
        # See: https://en.wikipedia.org/wiki/DBZ_(meteorology)
        if np.isfinite(dbz):
            mmperhr = pow(pow(10, dbz/10)/200, 0.625)
            sum = sum + mmperhr * hours
    return sum


def simplesum(ref,hour):
    hour.sum()

# each unique Id is an hour of data at some gauge
def myfunc(hour):
    #rowid = hour['Id'].iloc[0]
    # sort hour by minutes_past
    hour = hour.sort('minutes_past', ascending=True)
    est = marshall_palmer(hour['Ref'], hour['minutes_past'])
    return est



In [ ]:

    
info = raw.groupby(raw.index)



In [ ]:

    
estimates = raw.groupby(raw.index).apply(myfunc)
estimates.head(20)



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:

    
%%time
etreg.fit(X_train,y_train)



In [ ]:

    
%%time
et_score = cross_val_score(etreg, XX, y, cv=5)
print("Score: %s\tMean: %.03f"%(et_score,et_score.mean()))



In [ ]:

    
%%time
et_score = cross_val_score(etreg, XX, y, cv=5)
print("Score: %s\tMean: %.03f"%(et_score,et_score.mean()))



In [ ]:

    
err = (etreg.predict(X_test)-y_test)**2
err.sum()/len(err)



In [ ]:

    
err = (etreg.predict(X_test)-y_test)**2
err.sum()/len(err)



In [ ]:

    
r = random.randrange(len(X_train))
print(r)
print(etreg.predict(X_train[r]))
print(y_train[r])

r = random.randrange(len(X_test))
print(r)
print(etreg.predict(X_test[r]))
print(y_test[r])



In [102]:

    
%%time
filename = "data/reduced_test_5000.csv"
#filename = "data/test.csv"
test = pd.read_csv(filename)
test = test.set_index('Id')









    



CPU times: user 10.6 ms, sys: 3.85 ms, total: 14.5 ms
Wall time: 14.9 ms



In [113]:

    
features_columns = list([u'Ref', u'Ref_5x5_10th',
       u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
       u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
       u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
       u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
       u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
       u'Kdp_5x5_50th', u'Kdp_5x5_90th'])

def getX(raw):
    selected_columns = list([ u'minutes_past',u'radardist_km', u'Ref', u'Ref_5x5_10th',
       u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
       u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
       u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
       u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
       u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
       u'Kdp_5x5_50th', u'Kdp_5x5_90th'])
    
    data = raw[selected_columns]
    
    docX= []
    for i in data.index.unique():
        if isinstance(data.loc[i],pd.core.series.Series):
            m = [data.loc[i].as_matrix()]
            docX.append(m)
        else:
            m = data.loc[i].as_matrix()
            docX.append(m)
    X = np.array(docX)
    return X



In [ ]:

    
# We select all features except for the minutes past,
# because we ignore the time repartition of the sequence for now

features_columns = list([u'Ref', u'Ref_5x5_10th',
       u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
       u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
       u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
       u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
       u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
       u'Kdp_5x5_50th', u'Kdp_5x5_90th'])

def getXy(raw):
    selected_columns = list([ u'minutes_past',u'radardist_km', u'Ref', u'Ref_5x5_10th',
       u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
       u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
       u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
       u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
       u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
       u'Kdp_5x5_50th', u'Kdp_5x5_90th'])
    
    data = raw[selected_columns]
    
    docX, docY = [], []
    for i in data.index.unique():
        if isinstance(data.loc[i],pd.core.series.Series):
            m = [data.loc[i].as_matrix()]
            docX.append(m)
            docY.append(float(raw.loc[i]["Expected"]))
        else:
            m = data.loc[i].as_matrix()
            docX.append(m)
            docY.append(float(raw.loc[i][:1]["Expected"]))
    X , y = np.array(docX) , np.array(docY)
    return X,y



In [114]:

    
X=getX(test)

tmp = []
for i in X:
    tmp.append(len(i))
tmp = np.array(tmp)
sns.countplot(tmp,order=range(tmp.min(),tmp.max()+1))
plt.title("Number of ID per number of observations\n(On test dataset)")
plt.plot()









    



/Library/Python/2.7/site-packages/ipykernel/__main__.py:18: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future






    



---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-114-bc80d4902c17> in <module>()
----> 1 X=getX(test)
      2 
      3 tmp = []
      4 for i in X:
      5     tmp.append(len(i))

<ipython-input-113-655fb5005179> in getX(raw)
     16        u'Kdp_5x5_50th', u'Kdp_5x5_90th'])
     17 
---> 18     data = raw[selected_columns]
     19 
     20     docX= []

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices



In [105]:

    
testFull = test.dropna()



In [106]:

    
%%time
X=getX(testFull)  # 1min
XX = [np.array(t).mean(0) for t in X]  # 10s









    



CPU times: user 29.6 ms, sys: 2.63 ms, total: 32.3 ms
Wall time: 30 ms



In [107]:

    
%%time
XX=[]
for t in X:
    #print(idx)
    
    tmp = np.append(np.nanmean(np.array(t),0),(np.array(t)[1:] - np.array(t)[:-1]).sum(0) )
    tmp = np.append(tmp,np.percentile(t,10,axis=0))
    tmp = np.append(tmp,np.percentile(t,50,axis=0))
    tmp = np.append(tmp,np.percentile(t,90,axis=0))

    test = t
    
    try:
        taa=test[:,0]
    except TypeError:
        taa=[test[0][0]]
    valid_time = np.zeros_like(taa)
    valid_time[0] = taa[0]
    for n in xrange(1,len(taa)):
        valid_time[n] = taa[n] - taa[n-1]
    valid_time[-1] = valid_time[-1] + 60 - np.sum(valid_time)
    valid_time = valid_time / 60.0

    
    sum=0
    try:
        column_ref=test[:,2]
    except TypeError:
        column_ref=[test[0][2]]
    for dbz, hours in zip(column_ref, valid_time):
        # See: https://en.wikipedia.org/wiki/DBZ_(meteorology)
        if np.isfinite(dbz):
            mmperhr = pow(pow(10, dbz/10)/200, 0.625)
            sum = sum + mmperhr * hours
            
    XX.append(np.append(np.array(sum),tmp))
    #XX.append(np.array([sum]))
    #XX.append(tmp)









    



CPU times: user 30.5 ms, sys: 1.65 ms, total: 32.2 ms
Wall time: 30.9 ms



In [108]:

    
pd.DataFrame(gbr.predict(XX)).describe()









    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-108-6727ce741786> in <module>()
----> 1 pd.DataFrame(gbr.predict(XX)).describe()

/Library/Python/2.7/site-packages/sklearn/ensemble/gradient_boosting.pyc in predict(self, X)
   1597             The predicted values.
   1598         """
-> 1599         return self.decision_function(X).ravel()
   1600 
   1601     def staged_predict(self, X):

/Library/Python/2.7/site-packages/sklearn/ensemble/gradient_boosting.pyc in decision_function(self, X)
   1100         """
   1101         X = check_array(X, dtype=DTYPE, order="C")
-> 1102         score = self._decision_function(X)
   1103         if score.shape[1] == 1:
   1104             return score.ravel()

/Library/Python/2.7/site-packages/sklearn/ensemble/gradient_boosting.pyc in _decision_function(self, X)
   1079         # for use in inner loop, not raveling the output in single-class case,
   1080         # not doing input validation.
-> 1081         score = self._init_decision_function(X)
   1082         predict_stages(self.estimators_, X, self.learning_rate, score)
   1083         return score

/Library/Python/2.7/site-packages/sklearn/ensemble/gradient_boosting.pyc in _init_decision_function(self, X)
   1072         if X.shape[1] != self.n_features:
   1073             raise ValueError("X.shape[1] should be {0:d}, not {1:d}.".format(
-> 1074                 self.n_features, X.shape[1]))
   1075         score = self.init_.predict(X).astype(np.float64)
   1076         return score

ValueError: X.shape[1] should be 111, not 106.



In [ ]:

    
predFull = zip(testFull.index.unique(),etreg.predict(XX))



In [ ]:

    
testNan = test.drop(test[features_columns].dropna(how='all').index)



In [ ]:

    
tmp = np.empty(len(testNan))
tmp.fill(0.445000)   # 50th percentile of full Nan dataset
predNan = zip(testNan.index.unique(),tmp)



In [ ]:

    
testLeft = test.drop(testNan.index.unique()).drop(testFull.index.unique())



In [ ]:

    
tmp = np.empty(len(testLeft))
tmp.fill(1.27)   # 50th percentile of full Nan dataset
predLeft = zip(testLeft.index.unique(),tmp)



In [ ]:

    
len(testFull.index.unique())



In [ ]:

    
len(testNan.index.unique())



In [ ]:

    
len(testLeft.index.unique())



In [ ]:

    
pred = predFull + predNan + predLeft



In [ ]:

    
pred.sort(key=lambda x: x[0], reverse=False)



In [ ]:

    
submission = pd.DataFrame(pred)
submission.columns = ["Id","Expected"]
submission.head()



In [ ]:

    
submission.to_csv("first_submit.csv",index=False)



In [ ]:



In [ ]:

    
filename = "data/sample_solution.csv"
sol = pd.read_csv(filename)



In [ ]:

    
ss = np.array(sol)



In [ ]:

    
%%time
for a,b in predFull:
    ss[a-1][1]=b



In [ ]:

    
ss



In [ ]:

    
sub = pd.DataFrame(ss)
sub.columns = ["Id","Expected"]
sub.Id = sub.Id.astype(int)
sub.head()



In [ ]:

    
sub.to_csv("submit_2.csv",index=False)



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:

	minutes_past	radardist_km	Ref	Ref_5x5_10th	Ref_5x5_50th	Ref_5x5_90th	RefComposite	RefComposite_5x5_10th	RefComposite_5x5_50th	RefComposite_5x5_90th	...	RhoHV_5x5_90th	Zdr	Zdr_5x5_10th	Zdr_5x5_50th	Zdr_5x5_90th	Kdp	Kdp_5x5_10th	Kdp_5x5_50th	Kdp_5x5_90th	Expected
Id
1	3	10	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.254
1	16	10	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.254
1	25	10	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.254
1	35	10	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.254
1	45	10	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.254

	minutes_past	radardist_km	Ref	Ref_5x5_10th	Ref_5x5_50th	Ref_5x5_90th	RefComposite	RefComposite_5x5_10th	RefComposite_5x5_50th	RefComposite_5x5_90th	...	RhoHV_5x5_90th	Zdr	Zdr_5x5_10th	Zdr_5x5_50th	Zdr_5x5_90th	Kdp	Kdp_5x5_10th	Kdp_5x5_50th	Kdp_5x5_90th	Expected
count	93759.000000	93759.000000	44923.000000	38391.000000	45078.000000	53038.000000	47946.000000	42219.000000	48042.000000	55118.000000	...	42064.000000	35938.000000	30835.000000	35925.000000	42064.000000	31231.000000	26418.000000	31283.000000	36505.000000	93759.000000
mean	29.684830	11.022334	23.684482	20.788948	23.378688	26.427731	25.424874	22.956797	25.139201	27.982365	...	1.014742	0.597837	-0.564851	0.429577	2.018197	-0.013098	-3.383198	-0.429909	3.855601	5.837679
std	17.418876	4.259865	10.224306	9.073503	9.936862	11.186952	10.627954	9.638466	10.372406	11.535892	...	0.045336	1.388384	0.974288	0.864887	1.539513	3.747791	2.771442	2.194894	3.762005	22.764656
min	0.000000	0.000000	-29.000000	-31.500000	-31.500000	-26.500000	-26.500000	-27.500000	-25.000000	-23.000000	...	0.208333	-7.875000	-7.875000	-7.875000	-7.875000	-52.880005	-51.420000	-46.870010	-41.540010	0.010000
25%	15.000000	9.000000	17.500000	15.500000	17.500000	19.500000	19.000000	17.500000	18.500000	20.500000	...	0.998333	-0.062500	-1.000000	0.062500	1.125000	-1.410004	-4.230011	-0.710007	1.759994	0.254000
50%	30.000000	12.000000	24.000000	21.000000	23.500000	27.000000	25.500000	23.000000	25.500000	28.500000	...	1.005000	0.500000	-0.500000	0.375000	1.687500	0.000000	-2.809998	0.000000	3.169998	0.830000
75%	45.000000	14.000000	30.500000	27.000000	30.500000	34.500000	33.000000	29.500000	32.500000	36.500000	...	1.051667	1.125000	0.000000	0.750000	2.500000	1.409988	-1.740006	0.349991	5.289993	2.794001
max	59.000000	21.000000	64.500000	57.000000	61.500000	67.500000	68.000000	59.500000	64.000000	79.500000	...	1.051667	7.937500	5.937500	7.937500	7.937500	47.849990	1.759994	5.629990	43.209990	244.000120