FDMS TME3

Kaggle How Much Did It Rain? II

Florian Toque & Paul Willot

Notes

We tried different model, like SVM regression, MLP, Random Forest and KNN as recommanded by the winning team of the Kaggle on taxi trajectories. So far Random Forest seems to be the best, slightly better than the SVM.
The new features we exctracted only made a very small impact on predictions.


In [97]:
# from __future__ import exam_success
from __future__ import absolute_import
from __future__ import print_function

%matplotlib inline
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import random
import pandas as pd
import scipy.stats as stats

# Sk cheatsfrom sklearn.ensemble import ExtraTreesRegressor
from sklearn.cross_validation import cross_val_score  # cross val
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.preprocessing import Imputer   # get rid of nan
from sklearn.neighbors import KNeighborsRegressor
from sklearn import grid_search
import os
  • 13.765.202 lines in train.csv
  • 8.022.757 lines in test.csv

Few words about the dataset

Predictions is made in the USA corn growing states (mainly Iowa, Illinois, Indiana) during the season with the highest rainfall (as illustrated by Iowa for the april to august months)

The Kaggle page indicate that the dataset have been shuffled, so working on a subset seems acceptable
The test set is not a extracted from the same data as the training set however, which make the evaluation trickier

Load the dataset


In [111]:
%%time
#filename = "data/train.csv"
filename = "data/reduced_train_10000.csv"
#filename = "data/reduced_train_1000000.csv"
raw = pd.read_csv(filename)
raw = raw.set_index('Id')


CPU times: user 26.2 ms, sys: 6.82 ms, total: 33 ms
Wall time: 35.2 ms

In [112]:
raw.columns


Out[112]:
Index([u'minutes_past', u'radardist_km', u'Ref', u'Ref_5x5_10th',
       u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
       u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
       u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
       u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
       u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
       u'Kdp_5x5_50th', u'Kdp_5x5_90th', u'Expected'],
      dtype='object')

In [110]:
testFull.columns


Out[110]:
Index([u'minutes_past', u'radardist_km', u'Ref', u'Ref_5x5_10th',
       u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
       u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
       u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
       u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
       u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
       u'Kdp_5x5_50th', u'Kdp_5x5_90th'],
      dtype='object')

In [3]:
raw['Expected'].describe()


Out[3]:
count    100000.000000
mean        129.579825
std         687.622542
min           0.010000
25%           0.254000
50%           1.016000
75%           3.556002
max       32740.617000
Name: Expected, dtype: float64

Per wikipedia, a value of more than 421 mm/h is considered "Extreme/large hail"
If we encounter the value 327.40 meter per hour, we should probably start building Noah's ark
Therefor, it seems reasonable to drop values too large, considered as outliers


In [87]:
# Considering that the gauge may concentrate the rainfall, we set the cap to 1000
# Comment this line to analyse the complete dataset 
l = len(raw)
raw = raw[raw['Expected'] < 300]  #1000
print("Dropped %d (%0.2f%%)"%(l-len(raw),(l-len(raw))/float(l)*100))


Dropped 51720 (5.17%)

In [5]:
raw.head(5)


Out[5]:
minutes_past radardist_km Ref Ref_5x5_10th Ref_5x5_50th Ref_5x5_90th RefComposite RefComposite_5x5_10th RefComposite_5x5_50th RefComposite_5x5_90th ... RhoHV_5x5_90th Zdr Zdr_5x5_10th Zdr_5x5_50th Zdr_5x5_90th Kdp Kdp_5x5_10th Kdp_5x5_50th Kdp_5x5_90th Expected
Id
1 3 10 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.254
1 16 10 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.254
1 25 10 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.254
1 35 10 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.254
1 45 10 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.254

5 rows × 23 columns


In [6]:
raw.describe()


Out[6]:
minutes_past radardist_km Ref Ref_5x5_10th Ref_5x5_50th Ref_5x5_90th RefComposite RefComposite_5x5_10th RefComposite_5x5_50th RefComposite_5x5_90th ... RhoHV_5x5_90th Zdr Zdr_5x5_10th Zdr_5x5_50th Zdr_5x5_90th Kdp Kdp_5x5_10th Kdp_5x5_50th Kdp_5x5_90th Expected
count 93759.000000 93759.000000 44923.000000 38391.000000 45078.000000 53038.000000 47946.000000 42219.000000 48042.000000 55118.000000 ... 42064.000000 35938.000000 30835.000000 35925.000000 42064.000000 31231.000000 26418.000000 31283.000000 36505.000000 93759.000000
mean 29.684830 11.022334 23.684482 20.788948 23.378688 26.427731 25.424874 22.956797 25.139201 27.982365 ... 1.014742 0.597837 -0.564851 0.429577 2.018197 -0.013098 -3.383198 -0.429909 3.855601 5.837679
std 17.418876 4.259865 10.224306 9.073503 9.936862 11.186952 10.627954 9.638466 10.372406 11.535892 ... 0.045336 1.388384 0.974288 0.864887 1.539513 3.747791 2.771442 2.194894 3.762005 22.764656
min 0.000000 0.000000 -29.000000 -31.500000 -31.500000 -26.500000 -26.500000 -27.500000 -25.000000 -23.000000 ... 0.208333 -7.875000 -7.875000 -7.875000 -7.875000 -52.880005 -51.420000 -46.870010 -41.540010 0.010000
25% 15.000000 9.000000 17.500000 15.500000 17.500000 19.500000 19.000000 17.500000 18.500000 20.500000 ... 0.998333 -0.062500 -1.000000 0.062500 1.125000 -1.410004 -4.230011 -0.710007 1.759994 0.254000
50% 30.000000 12.000000 24.000000 21.000000 23.500000 27.000000 25.500000 23.000000 25.500000 28.500000 ... 1.005000 0.500000 -0.500000 0.375000 1.687500 0.000000 -2.809998 0.000000 3.169998 0.830000
75% 45.000000 14.000000 30.500000 27.000000 30.500000 34.500000 33.000000 29.500000 32.500000 36.500000 ... 1.051667 1.125000 0.000000 0.750000 2.500000 1.409988 -1.740006 0.349991 5.289993 2.794001
max 59.000000 21.000000 64.500000 57.000000 61.500000 67.500000 68.000000 59.500000 64.000000 79.500000 ... 1.051667 7.937500 5.937500 7.937500 7.937500 47.849990 1.759994 5.629990 43.209990 244.000120

8 rows × 23 columns

We regroup the data by ID


In [7]:
# We select all features except for the minutes past,
# because we ignore the time repartition of the sequence for now

features_columns = list([u'Ref', u'Ref_5x5_10th',
       u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
       u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
       u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
       u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
       u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
       u'Kdp_5x5_50th', u'Kdp_5x5_90th'])

def getXy(raw):
    selected_columns = list([ u'minutes_past',u'radardist_km', u'Ref', u'Ref_5x5_10th',
       u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
       u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
       u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
       u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
       u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
       u'Kdp_5x5_50th', u'Kdp_5x5_90th'])
    
    data = raw[selected_columns]
    
    docX, docY = [], []
    for i in data.index.unique():
        if isinstance(data.loc[i],pd.core.series.Series):
            m = [data.loc[i].as_matrix()]
            docX.append(m)
            docY.append(float(raw.loc[i]["Expected"]))
        else:
            m = data.loc[i].as_matrix()
            docX.append(m)
            docY.append(float(raw.loc[i][:1]["Expected"]))
    X , y = np.array(docX) , np.array(docY)
    return X,y

On fully filled dataset


In [88]:
#noAnyNan = raw.loc[raw[features_columns].dropna(how='any').index.unique()]
noAnyNan = raw.dropna()

In [89]:
noFullNan = raw.loc[raw[features_columns].dropna(how='all').index.unique()]

In [90]:
fullNan = raw.drop(raw[features_columns].dropna(how='all').index)

Predicitons

As a first try, we make predictions on the complete data, and return the 50th percentile and uncomplete and fully empty data


In [91]:
%%time
X,y=getXy(noAnyNan)


CPU times: user 17.5 s, sys: 127 ms, total: 17.6 s
Wall time: 17.7 s

In [12]:
%%time
#XX = [np.array(t).mean(0) for t in X]
XX = [np.append(np.nanmean(np.array(t),0),(np.array(t)[1:] - np.array(t)[:-1]).sum(0) ) for t in X]


CPU times: user 175 ms, sys: 3.33 ms, total: 178 ms
Wall time: 179 ms

In [92]:
%%time
XX=[]
for t in X:
    #print(idx)
    
    tmp = np.append(np.nanmean(np.array(t),0),(np.array(t)[1:] - np.array(t)[:-1]).sum(0) )
    tmp = np.append(tmp,np.percentile(t,10,axis=0))
    tmp = np.append(tmp,np.percentile(t,50,axis=0))
    tmp = np.append(tmp,np.percentile(t,90,axis=0))

    test = t
    
    try:
        taa=test[:,0]
    except TypeError:
        taa=[test[0][0]]
    valid_time = np.zeros_like(taa)
    valid_time[0] = taa[0]
    for n in xrange(1,len(taa)):
        valid_time[n] = taa[n] - taa[n-1]
    valid_time[-1] = valid_time[-1] + 60 - np.sum(valid_time)
    valid_time = valid_time / 60.0

    
    sum=0
    try:
        column_ref=test[:,2]
    except TypeError:
        column_ref=[test[0][2]]
    for dbz, hours in zip(column_ref, valid_time):
        # See: https://en.wikipedia.org/wiki/DBZ_(meteorology)
        if np.isfinite(dbz):
            mmperhr = pow(pow(10, dbz/10)/200, 0.625)
            sum = sum + mmperhr * hours
            
    XX.append(np.append(np.array(sum),tmp))
    #XX.append(np.array([sum]))
    #XX.append(tmp)


CPU times: user 8.67 s, sys: 36.9 ms, total: 8.7 s
Wall time: 8.73 s

In [14]:
XX[2]


Out[14]:
array([  5.99821681e-01,   2.94285714e+01,   1.20000000e+01,
         1.92857143e+01,   1.73571429e+01,   1.99285714e+01,
         2.30714286e+01,   1.92857143e+01,   1.73571429e+01,
         1.99285714e+01,   2.30714286e+01,   9.85476206e-01,
         9.68333350e-01,   9.91190490e-01,   1.01166668e+00,
         4.82142857e-01,  -4.28571429e-01,   1.96428571e-01,
         9.64285714e-01,  -1.45000351e+00,  -2.45857240e+00,
        -1.50002620e-01,   2.46570911e+00,   5.70000000e+01,
         0.00000000e+00,  -6.00000000e+00,  -1.50000000e+00,
        -1.50000000e+00,  -5.00000000e-01,  -6.00000000e+00,
        -1.50000000e+00,  -1.50000000e+00,  -5.00000000e-01,
         3.00000000e-02,   2.00000600e-02,   1.33333000e-02,
        -6.66670000e-03,   1.50000000e+00,   0.00000000e+00,
        -6.25000000e-02,   1.87500000e-01,   6.99997000e-01,
        -1.40998840e+00,  -1.41000370e+00,  -2.11999510e+00,
         6.40000000e+00,   1.20000000e+01,   1.59000000e+01,
         1.59000000e+01,   1.85000000e+01,   2.18000000e+01,
         1.59000000e+01,   1.59000000e+01,   1.85000000e+01,
         2.18000000e+01,   9.65666672e-01,   9.58333338e-01,
         9.81666700e-01,   1.00033335e+00,  -1.75000000e-01,
        -6.50000000e-01,   1.00000000e-01,   6.75000000e-01,
        -3.16999820e+00,  -3.16999820e+00,  -1.05999760e+00,
         1.75999450e+00,   2.90000000e+01,   1.20000000e+01,
         2.00000000e+01,   1.65000000e+01,   1.95000000e+01,
         2.30000000e+01,   2.00000000e+01,   1.65000000e+01,
         1.95000000e+01,   2.30000000e+01,   9.88333340e-01,
         9.68333360e-01,   9.95000000e-01,   1.00833330e+00,
         5.62500000e-01,  -4.37500000e-01,   1.87500000e-01,
         9.37500000e-01,  -1.34001160e+00,  -2.47000120e+00,
        -3.50006100e-01,   2.11999500e+00,   5.26000000e+01,
         1.20000000e+01,   2.20000000e+01,   1.96000000e+01,
         2.18000000e+01,   2.44000000e+01,   2.20000000e+01,
         1.96000000e+01,   2.18000000e+01,   2.44000000e+01,
         1.00166668e+00,   9.77666680e-01,   9.96333332e-01,
         1.02700000e+00,   1.21250000e+00,  -2.37500000e-01,
         2.75000000e-01,   1.27500000e+00,  -1.42007436e-01,
        -1.60000612e+00,   8.49993916e-01,   3.65799254e+00])
(X[0][1:] - X[0][:-1]).sum(0) (t[1:] - t[:-1]).sum(0) %%time etreg = etreg.fit(XX,y)

In [93]:
def splitTrainTest(X, y, split=0.2):
    tmp1, tmp2 = [], []
    ps = int(len(X) * (1-split))
    index_shuf = range(len(X))
    random.shuffle(index_shuf)
    for i in index_shuf:
        tmp1.append(X[i])
        tmp2.append(y[i])
    return tmp1[:ps], tmp2[:ps], tmp1[ps:], tmp2[ps:]

In [94]:
X_train,y_train, X_test, y_test = splitTrainTest(XX,y)


In [22]:
def manualScorer(estimator, X, y):
    err = (estimator.predict(X_test)-y_test)**2
    return -err.sum()/len(err)

max prof 24 nb trees 84 min sample per leaf 17 min sample to split 51


In [21]:
from sklearn import svm

In [18]:
svr = svm.SVR(C=100000)

In [ ]:
%%time
srv = svr.fit(X_train,y_train)

In [ ]:
err = (svr.predict(X_train)-y_train)**2
err.sum()/len(err)

In [ ]:
err = (svr.predict(X_test)-y_test)**2
err.sum()/len(err)

In [25]:
%%time
svr_score = cross_val_score(svr, XX, y, cv=5)
print("Score: %s\nMean: %.03f"%(svr_score,svr_score.mean()))


Score: [ 0.52865286  0.91093893  0.8215682   0.87335142  0.21272327]
Mean: 0.669
CPU times: user 6.05 s, sys: 53.1 ms, total: 6.1 s
Wall time: 6.16 s


In [27]:
knn = KNeighborsRegressor(n_neighbors=6,weights='distance',algorithm='ball_tree')

In [28]:
#parameters = {'weights':('distance','uniform'),'algorithm':('auto', 'ball_tree', 'kd_tree', 'brute')}
parameters = {'n_neighbors':range(1,10,1)}
grid_knn = grid_search.GridSearchCV(knn, parameters,scoring=manualScorer)

In [29]:
%%time
grid_knn.fit(X_train,y_train)


CPU times: user 1.26 s, sys: 4.03 ms, total: 1.27 s
Wall time: 1.27 s
Out[29]:
GridSearchCV(cv=None, error_score='raise',
       estimator=KNeighborsRegressor(algorithm='ball_tree', leaf_size=30, metric='minkowski',
          metric_params=None, n_neighbors=6, p=2, weights='distance'),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9]},
       pre_dispatch='2*n_jobs', refit=True, score_func=None,
       scoring=<function manualScorer at 0x110aa0ed8>, verbose=0)

In [30]:
print(grid_knn.grid_scores_)
print("Best: ",grid_knn.best_params_)


[mean: -104.53917, std: 46.90261, params: {'n_neighbors': 1}, mean: -91.09434, std: 35.17553, params: {'n_neighbors': 2}, mean: -101.84580, std: 49.26225, params: {'n_neighbors': 3}, mean: -105.42949, std: 71.07642, params: {'n_neighbors': 4}, mean: -103.53711, std: 77.01319, params: {'n_neighbors': 5}, mean: -95.13442, std: 69.23005, params: {'n_neighbors': 6}, mean: -91.41373, std: 65.75289, params: {'n_neighbors': 7}, mean: -86.27320, std: 61.05561, params: {'n_neighbors': 8}, mean: -82.19850, std: 57.05147, params: {'n_neighbors': 9}]
Best:  {'n_neighbors': 9}

In [31]:
knn = grid_knn.best_estimator_

In [32]:
knn= knn.fit(X_train,y_train)

In [33]:
err = (knn.predict(X_train)-y_train)**2
err.sum()/len(err)


Out[33]:
2.1117743915936922

In [34]:
err = (knn.predict(X_test)-y_test)**2
err.sum()/len(err)


Out[34]:
42.269848400778471


In [35]:
etreg = ExtraTreesRegressor(n_estimators=200, max_depth=None, min_samples_split=1, random_state=0)

In [36]:
parameters = {'n_estimators':range(100,200,20)}
grid_rf = grid_search.GridSearchCV(etreg, parameters,n_jobs=2,scoring=manualScorer)

In [37]:
%%time
grid_rf.fit(X_train,y_train)


CPU times: user 4.44 s, sys: 241 ms, total: 4.68 s
Wall time: 16.3 s
Out[37]:
GridSearchCV(cv=None, error_score='raise',
       estimator=ExtraTreesRegressor(bootstrap=False, criterion='mse', max_depth=None,
          max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
          min_samples_split=1, min_weight_fraction_leaf=0.0,
          n_estimators=200, n_jobs=1, oob_score=False, random_state=0,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, loss_func=None, n_jobs=2,
       param_grid={'n_estimators': [100, 120, 140, 160, 180]},
       pre_dispatch='2*n_jobs', refit=True, score_func=None,
       scoring=<function manualScorer at 0x110aa0ed8>, verbose=0)

In [38]:
print(grid_rf.grid_scores_)
print("Best: ",grid_rf.best_params_)


[mean: -55.73522, std: 40.35044, params: {'n_estimators': 100}, mean: -55.47051, std: 39.85010, params: {'n_estimators': 120}, mean: -56.18434, std: 40.62698, params: {'n_estimators': 140}, mean: -56.15046, std: 40.74838, params: {'n_estimators': 160}, mean: -56.37052, std: 40.72395, params: {'n_estimators': 180}]
Best:  {'n_estimators': 120}

In [39]:
grid_rf.best_params_


Out[39]:
{'n_estimators': 120}

In [40]:
es = etreg
#es = grid_rf.best_estimator_

In [41]:
%%time
es = es.fit(X_train,y_train)


CPU times: user 3.4 s, sys: 16.4 ms, total: 3.41 s
Wall time: 3.42 s

In [42]:
err = (es.predict(X_train)-y_train)**2
err.sum()/len(err)


Out[42]:
2.1117743915936917

In [43]:
err = (es.predict(X_test)-y_test)**2
err.sum()/len(err)


Out[43]:
23.042852064874744


In [98]:
gbr = GradientBoostingRegressor(loss='ls', learning_rate=0.1, n_estimators=900,
                                subsample=1.0, min_samples_split=2, min_samples_leaf=1,
                                min_weight_fraction_leaf=0.0, max_depth=4, init=None,
                                random_state=None, max_features=None, alpha=0.5,
                                verbose=0, max_leaf_nodes=None, warm_start=False)

In [99]:
%%time
gbr = gbr.fit(X_train,y_train)
os.system('say "終わりだ"') #its over!


CPU times: user 4min 48s, sys: 653 ms, total: 4min 49s
Wall time: 4min 51s

In [80]:
#parameters = {'max_depth':range(2,5,1),'alpha':[0.5,0.6,0.7,0.8,0.9]}
#parameters = {'subsample':[0.2,0.4,0.5,0.6,0.8,1]}
#parameters = {'subsample':[0.2,0.5,0.6,0.8,1],'n_estimators':[800,1000,1200]}
#parameters = {'max_depth':range(2,4,1)}
#parameters = {'n_estimators':[600,800,900]}
#parameters = {'loss':['ls', 'lad', 'huber', 'quantile'],'alpha':[0.3,0.5,0.8,0.9]}

grid_gbr = grid_search.GridSearchCV(gbr, parameters,n_jobs=2,scoring=manualScorer)

In [81]:
%%time
grid_gbr = grid_gbr.fit(X_train,y_train)


CPU times: user 16.4 s, sys: 1.16 s, total: 17.5 s
Wall time: 2min 27s

In [82]:
print(grid_gbr.grid_scores_)
print("Best: ",grid_gbr.best_params_)


[mean: -42.96759, std: 2.26566, params: {'alpha': 0.3, 'loss': 'ls'}, mean: -112.57348, std: 2.89117, params: {'alpha': 0.3, 'loss': 'lad'}, mean: -100.31314, std: 6.08331, params: {'alpha': 0.3, 'loss': 'huber'}, mean: -130.80967, std: 0.93008, params: {'alpha': 0.3, 'loss': 'quantile'}, mean: -42.74488, std: 7.80268, params: {'alpha': 0.5, 'loss': 'ls'}, mean: -114.97495, std: 1.49172, params: {'alpha': 0.5, 'loss': 'lad'}, mean: -80.46645, std: 6.05650, params: {'alpha': 0.5, 'loss': 'huber'}, mean: -116.54107, std: 4.52786, params: {'alpha': 0.5, 'loss': 'quantile'}, mean: -48.71631, std: 9.54319, params: {'alpha': 0.8, 'loss': 'ls'}, mean: -114.75080, std: 2.35584, params: {'alpha': 0.8, 'loss': 'lad'}, mean: -51.93217, std: 9.86208, params: {'alpha': 0.8, 'loss': 'huber'}, mean: -85.26825, std: 3.29395, params: {'alpha': 0.8, 'loss': 'quantile'}, mean: -45.76086, std: 2.27370, params: {'alpha': 0.9, 'loss': 'ls'}, mean: -117.01689, std: 8.90194, params: {'alpha': 0.9, 'loss': 'lad'}, mean: -44.38461, std: 7.82885, params: {'alpha': 0.9, 'loss': 'huber'}, mean: -102.06466, std: 7.07691, params: {'alpha': 0.9, 'loss': 'quantile'}]
Best:  {'alpha': 0.5, 'loss': 'ls'}

In [85]:
err = (gbr.predict(X_train)-y_train)**2
print(err.sum()/len(err))
err = (gbr.predict(X_test)-y_test)**2
print(err.sum()/len(err))


1.82991003151
23.8878088993

In [100]:
err = (gbr.predict(X_train)-y_train)**2
print(err.sum()/len(err))
err = (gbr.predict(X_test)-y_test)**2
print(err.sum()/len(err))


27.325739932
123.44709784


In [ ]:
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD

in_dim = len(XX[0])
out_dim = 1  

model = Sequential()
# Dense(64) is a fully-connected layer with 64 hidden units.
# in the first layer, you must specify the expected input data shape:
# here, 20-dimensional vectors.
model.add(Dense(128, input_shape=(in_dim,)))
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Dense(1, init='uniform'))
model.add(Activation('linear'))

#sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
#model.compile(loss='mean_squared_error', optimizer=sgd)

rms = RMSprop()
model.compile(loss='mean_squared_error', optimizer=rms)

#model.fit(X_train, y_train, nb_epoch=20, batch_size=16)
#score = model.evaluate(X_test, y_test, batch_size=16)

In [ ]:
prep = []
for i in y_train:
    prep.append(min(i,20))

In [ ]:
prep=np.array(prep)
mi,ma = prep.min(),prep.max()
fy = (prep-mi) / (ma-mi)
#my = fy.max()
#fy = fy/fy.max()

In [ ]:
model.fit(np.array(X_train), fy, batch_size=10, nb_epoch=10, validation_split=0.1)

In [ ]:
pred = model.predict(np.array(X_test))*ma+mi

In [ ]:
err = (pred-y_test)**2
err.sum()/len(err)

In [ ]:
r = random.randrange(len(X_train))
print("(Train) Prediction %0.4f, True: %0.4f"%(model.predict(np.array([X_train[r]]))[0][0]*ma+mi,y_train[r]))

r = random.randrange(len(X_test))
print("(Test)  Prediction %0.4f, True: %0.4f"%(model.predict(np.array([X_test[r]]))[0][0]*ma+mi,y_test[r]))


In [ ]:
def marshall_palmer(ref, minutes_past):
    #print("Estimating rainfall from {0} observations".format(len(minutes_past)))
    # how long is each observation valid?
    valid_time = np.zeros_like(minutes_past)
    valid_time[0] = minutes_past.iloc[0]
    for n in xrange(1, len(minutes_past)):
        valid_time[n] = minutes_past.iloc[n] - minutes_past.iloc[n-1]
    valid_time[-1] = valid_time[-1] + 60 - np.sum(valid_time)
    valid_time = valid_time / 60.0

    # sum up rainrate * validtime
    sum = 0
    for dbz, hours in zip(ref, valid_time):
        # See: https://en.wikipedia.org/wiki/DBZ_(meteorology)
        if np.isfinite(dbz):
            mmperhr = pow(pow(10, dbz/10)/200, 0.625)
            sum = sum + mmperhr * hours
    return sum


def simplesum(ref,hour):
    hour.sum()

# each unique Id is an hour of data at some gauge
def myfunc(hour):
    #rowid = hour['Id'].iloc[0]
    # sort hour by minutes_past
    hour = hour.sort('minutes_past', ascending=True)
    est = marshall_palmer(hour['Ref'], hour['minutes_past'])
    return est

In [ ]:
info = raw.groupby(raw.index)

In [ ]:
estimates = raw.groupby(raw.index).apply(myfunc)
estimates.head(20)

In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:
%%time
etreg.fit(X_train,y_train)

In [ ]:
%%time
et_score = cross_val_score(etreg, XX, y, cv=5)
print("Score: %s\tMean: %.03f"%(et_score,et_score.mean()))

In [ ]:
%%time
et_score = cross_val_score(etreg, XX, y, cv=5)
print("Score: %s\tMean: %.03f"%(et_score,et_score.mean()))

In [ ]:
err = (etreg.predict(X_test)-y_test)**2
err.sum()/len(err)

In [ ]:
err = (etreg.predict(X_test)-y_test)**2
err.sum()/len(err)

In [ ]:
r = random.randrange(len(X_train))
print(r)
print(etreg.predict(X_train[r]))
print(y_train[r])

r = random.randrange(len(X_test))
print(r)
print(etreg.predict(X_test[r]))
print(y_test[r])


In [102]:
%%time
filename = "data/reduced_test_5000.csv"
#filename = "data/test.csv"
test = pd.read_csv(filename)
test = test.set_index('Id')


CPU times: user 10.6 ms, sys: 3.85 ms, total: 14.5 ms
Wall time: 14.9 ms

In [113]:
features_columns = list([u'Ref', u'Ref_5x5_10th',
       u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
       u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
       u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
       u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
       u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
       u'Kdp_5x5_50th', u'Kdp_5x5_90th'])

def getX(raw):
    selected_columns = list([ u'minutes_past',u'radardist_km', u'Ref', u'Ref_5x5_10th',
       u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
       u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
       u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
       u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
       u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
       u'Kdp_5x5_50th', u'Kdp_5x5_90th'])
    
    data = raw[selected_columns]
    
    docX= []
    for i in data.index.unique():
        if isinstance(data.loc[i],pd.core.series.Series):
            m = [data.loc[i].as_matrix()]
            docX.append(m)
        else:
            m = data.loc[i].as_matrix()
            docX.append(m)
    X = np.array(docX)
    return X

In [ ]:
# We select all features except for the minutes past,
# because we ignore the time repartition of the sequence for now

features_columns = list([u'Ref', u'Ref_5x5_10th',
       u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
       u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
       u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
       u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
       u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
       u'Kdp_5x5_50th', u'Kdp_5x5_90th'])

def getXy(raw):
    selected_columns = list([ u'minutes_past',u'radardist_km', u'Ref', u'Ref_5x5_10th',
       u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
       u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
       u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
       u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
       u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
       u'Kdp_5x5_50th', u'Kdp_5x5_90th'])
    
    data = raw[selected_columns]
    
    docX, docY = [], []
    for i in data.index.unique():
        if isinstance(data.loc[i],pd.core.series.Series):
            m = [data.loc[i].as_matrix()]
            docX.append(m)
            docY.append(float(raw.loc[i]["Expected"]))
        else:
            m = data.loc[i].as_matrix()
            docX.append(m)
            docY.append(float(raw.loc[i][:1]["Expected"]))
    X , y = np.array(docX) , np.array(docY)
    return X,y

In [114]:
X=getX(test)

tmp = []
for i in X:
    tmp.append(len(i))
tmp = np.array(tmp)
sns.countplot(tmp,order=range(tmp.min(),tmp.max()+1))
plt.title("Number of ID per number of observations\n(On test dataset)")
plt.plot()


/Library/Python/2.7/site-packages/ipykernel/__main__.py:18: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-114-bc80d4902c17> in <module>()
----> 1 X=getX(test)
      2 
      3 tmp = []
      4 for i in X:
      5     tmp.append(len(i))

<ipython-input-113-655fb5005179> in getX(raw)
     16        u'Kdp_5x5_50th', u'Kdp_5x5_90th'])
     17 
---> 18     data = raw[selected_columns]
     19 
     20     docX= []

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

In [105]:
testFull = test.dropna()

In [106]:
%%time
X=getX(testFull)  # 1min
XX = [np.array(t).mean(0) for t in X]  # 10s


CPU times: user 29.6 ms, sys: 2.63 ms, total: 32.3 ms
Wall time: 30 ms

In [107]:
%%time
XX=[]
for t in X:
    #print(idx)
    
    tmp = np.append(np.nanmean(np.array(t),0),(np.array(t)[1:] - np.array(t)[:-1]).sum(0) )
    tmp = np.append(tmp,np.percentile(t,10,axis=0))
    tmp = np.append(tmp,np.percentile(t,50,axis=0))
    tmp = np.append(tmp,np.percentile(t,90,axis=0))

    test = t
    
    try:
        taa=test[:,0]
    except TypeError:
        taa=[test[0][0]]
    valid_time = np.zeros_like(taa)
    valid_time[0] = taa[0]
    for n in xrange(1,len(taa)):
        valid_time[n] = taa[n] - taa[n-1]
    valid_time[-1] = valid_time[-1] + 60 - np.sum(valid_time)
    valid_time = valid_time / 60.0

    
    sum=0
    try:
        column_ref=test[:,2]
    except TypeError:
        column_ref=[test[0][2]]
    for dbz, hours in zip(column_ref, valid_time):
        # See: https://en.wikipedia.org/wiki/DBZ_(meteorology)
        if np.isfinite(dbz):
            mmperhr = pow(pow(10, dbz/10)/200, 0.625)
            sum = sum + mmperhr * hours
            
    XX.append(np.append(np.array(sum),tmp))
    #XX.append(np.array([sum]))
    #XX.append(tmp)


CPU times: user 30.5 ms, sys: 1.65 ms, total: 32.2 ms
Wall time: 30.9 ms

In [108]:
pd.DataFrame(gbr.predict(XX)).describe()


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-108-6727ce741786> in <module>()
----> 1 pd.DataFrame(gbr.predict(XX)).describe()

/Library/Python/2.7/site-packages/sklearn/ensemble/gradient_boosting.pyc in predict(self, X)
   1597             The predicted values.
   1598         """
-> 1599         return self.decision_function(X).ravel()
   1600 
   1601     def staged_predict(self, X):

/Library/Python/2.7/site-packages/sklearn/ensemble/gradient_boosting.pyc in decision_function(self, X)
   1100         """
   1101         X = check_array(X, dtype=DTYPE, order="C")
-> 1102         score = self._decision_function(X)
   1103         if score.shape[1] == 1:
   1104             return score.ravel()

/Library/Python/2.7/site-packages/sklearn/ensemble/gradient_boosting.pyc in _decision_function(self, X)
   1079         # for use in inner loop, not raveling the output in single-class case,
   1080         # not doing input validation.
-> 1081         score = self._init_decision_function(X)
   1082         predict_stages(self.estimators_, X, self.learning_rate, score)
   1083         return score

/Library/Python/2.7/site-packages/sklearn/ensemble/gradient_boosting.pyc in _init_decision_function(self, X)
   1072         if X.shape[1] != self.n_features:
   1073             raise ValueError("X.shape[1] should be {0:d}, not {1:d}.".format(
-> 1074                 self.n_features, X.shape[1]))
   1075         score = self.init_.predict(X).astype(np.float64)
   1076         return score

ValueError: X.shape[1] should be 111, not 106.

In [ ]:
predFull = zip(testFull.index.unique(),etreg.predict(XX))

In [ ]:
testNan = test.drop(test[features_columns].dropna(how='all').index)

In [ ]:
tmp = np.empty(len(testNan))
tmp.fill(0.445000)   # 50th percentile of full Nan dataset
predNan = zip(testNan.index.unique(),tmp)

In [ ]:
testLeft = test.drop(testNan.index.unique()).drop(testFull.index.unique())

In [ ]:
tmp = np.empty(len(testLeft))
tmp.fill(1.27)   # 50th percentile of full Nan dataset
predLeft = zip(testLeft.index.unique(),tmp)

In [ ]:
len(testFull.index.unique())

In [ ]:
len(testNan.index.unique())

In [ ]:
len(testLeft.index.unique())

In [ ]:
pred = predFull + predNan + predLeft

In [ ]:
pred.sort(key=lambda x: x[0], reverse=False)

In [ ]:
submission = pd.DataFrame(pred)
submission.columns = ["Id","Expected"]
submission.head()

In [ ]:
submission.to_csv("first_submit.csv",index=False)

In [ ]:


In [ ]:
filename = "data/sample_solution.csv"
sol = pd.read_csv(filename)

In [ ]:
ss = np.array(sol)

In [ ]:
%%time
for a,b in predFull:
    ss[a-1][1]=b

In [ ]:
ss

In [ ]:
sub = pd.DataFrame(ss)
sub.columns = ["Id","Expected"]
sub.Id = sub.Id.astype(int)
sub.head()

In [ ]:
sub.to_csv("submit_2.csv",index=False)

In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]: