FDMS TME3

Kaggle How Much Did It Rain? II

Florian Toque & Paul Willot

Notes

We tried different model, like SVM regression, MLP, Random Forest and KNN as recommanded by the winning team of the Kaggle on taxi trajectories. So far Random Forest seems to be the best, slightly better than the SVM.
The new features we exctracted only made a very small impact on predictions.


In [2]:
# from __future__ import exam_success
from __future__ import absolute_import
from __future__ import print_function

%matplotlib inline
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import random
import pandas as pd
import scipy.stats as stats

# Sk cheats
from sklearn.cross_validation import cross_val_score  # cross val
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.preprocessing import Imputer   # get rid of nan
  • 13.765.202 lines in train.csv
  • 8.022.757 lines in test.csv

Few words about the dataset

Predictions is made in the USA corn growing states (mainly Iowa, Illinois, Indiana) during the season with the highest rainfall (as illustrated by Iowa for the april to august months)

The Kaggle page indicate that the dataset have been shuffled, so working on a subset seems acceptable
The test set is not a extracted from the same data as the training set however, which make the evaluation trickier

Load the dataset


In [3]:
%%time
filename = "data/reduced_train_100000.csv"
#filename = "data/reduced_train_100000.csv"
raw = pd.read_csv(filename)
raw = raw.set_index('Id')


CPU times: user 233 ms, sys: 40.5 ms, total: 274 ms
Wall time: 284 ms

In [4]:
raw['Expected'].describe()


Out[4]:
count    100000.000000
mean        129.579825
std         687.622542
min           0.010000
25%           0.254000
50%           1.016000
75%           3.556002
max       32740.617000
Name: Expected, dtype: float64

Per wikipedia, a value of more than 421 mm/h is considered "Extreme/large hail"
If we encounter the value 327.40 meter per hour, we should probably start building Noah's ark
Therefor, it seems reasonable to drop values too large, considered as outliers


In [5]:
# Considering that the gauge may concentrate the rainfall, we set the cap to 1000
# Comment this line to analyse the complete dataset 
l = len(raw)
raw = raw[raw['Expected'] < 1000]
print("Dropped %d (%0.2f%%)"%(l-len(raw),(l-len(raw))/float(l)*100))


Dropped 4175 (4.17%)

In [6]:
raw.head(5)


Out[6]:
minutes_past radardist_km Ref Ref_5x5_10th Ref_5x5_50th Ref_5x5_90th RefComposite RefComposite_5x5_10th RefComposite_5x5_50th RefComposite_5x5_90th ... RhoHV_5x5_90th Zdr Zdr_5x5_10th Zdr_5x5_50th Zdr_5x5_90th Kdp Kdp_5x5_10th Kdp_5x5_50th Kdp_5x5_90th Expected
Id
1 3 10 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.254
1 16 10 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.254
1 25 10 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.254
1 35 10 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.254
1 45 10 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.254

5 rows × 23 columns


In [7]:
raw.describe()


Out[7]:
minutes_past radardist_km Ref Ref_5x5_10th Ref_5x5_50th Ref_5x5_90th RefComposite RefComposite_5x5_10th RefComposite_5x5_50th RefComposite_5x5_90th ... RhoHV_5x5_90th Zdr Zdr_5x5_10th Zdr_5x5_50th Zdr_5x5_90th Kdp Kdp_5x5_10th Kdp_5x5_50th Kdp_5x5_90th Expected
count 95825.000000 95825.000000 44990.000000 38441.000000 45146.000000 53163.000000 48020.000000 42275.000000 48117.000000 55260.000000 ... 42210.000000 36040.000000 30901.000000 36025.000000 42210.000000 31318.000000 26464.000000 31368.000000 36627.000000 95825.000000
mean 29.686595 11.052523 23.684530 20.789496 23.378716 26.419606 25.422845 22.955328 25.136573 27.966965 ... 1.014748 0.597368 -0.565218 0.428472 2.017049 -0.011121 -3.384371 -0.431668 3.854969 17.433421
std 17.417786 4.239598 10.227057 9.074095 9.939687 11.193085 10.630984 9.639113 10.377139 11.544993 ... 0.045614 1.387422 0.973801 0.865122 1.539430 3.747653 2.772211 2.202447 3.761289 84.809357
min 0.000000 0.000000 -29.000000 -31.500000 -31.500000 -26.500000 -26.500000 -27.500000 -25.000000 -23.000000 ... 0.208333 -7.875000 -7.875000 -7.875000 -7.875000 -52.880005 -51.420000 -46.870010 -41.540010 0.010000
25% 15.000000 9.000000 17.500000 15.500000 17.500000 19.500000 19.000000 17.500000 18.500000 20.500000 ... 0.998333 -0.062500 -1.000000 0.062500 1.125000 -1.410004 -4.230011 -0.710007 1.759994 0.254000
50% 30.000000 12.000000 24.000000 21.000000 23.500000 27.000000 25.500000 23.000000 25.500000 28.500000 ... 1.005000 0.500000 -0.500000 0.375000 1.687500 0.000000 -2.809998 0.000000 3.169998 1.016000
75% 45.000000 14.000000 30.500000 27.000000 30.500000 34.500000 33.000000 29.500000 32.500000 36.500000 ... 1.051667 1.125000 0.000000 0.750000 2.500000 1.409988 -1.740006 0.349991 5.289993 3.048001
max 59.000000 21.000000 64.500000 57.000000 61.500000 67.500000 68.000000 59.500000 64.000000 79.500000 ... 1.051667 7.937500 5.937500 7.937500 7.937500 47.849990 1.759994 5.629990 43.209990 876.300500

8 rows × 23 columns

We regroup the data by ID


In [8]:
# We select all features except for the minutes past,
# because we ignore the time repartition of the sequence for now

features_columns = list([u'Ref', u'Ref_5x5_10th',
       u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
       u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
       u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
       u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
       u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
       u'Kdp_5x5_50th', u'Kdp_5x5_90th'])

def getXy(raw):
    selected_columns = list([ u'minutes_past',u'radardist_km', u'Ref', u'Ref_5x5_10th',
       u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
       u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
       u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
       u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
       u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
       u'Kdp_5x5_50th', u'Kdp_5x5_90th'])
    
    data = raw[selected_columns]
    
    docX, docY = [], []
    for i in data.index.unique():
        if isinstance(data.loc[i],pd.core.series.Series):
            m = [data.loc[i].as_matrix()]
            docX.append(m)
            docY.append(float(raw.loc[i]["Expected"]))
        else:
            m = data.loc[i].as_matrix()
            docX.append(m)
            docY.append(float(raw.loc[i][:1]["Expected"]))
    X , y = np.array(docX) , np.array(docY)
    return X,y

On fully filled dataset


In [9]:
#noAnyNan = raw.loc[raw[features_columns].dropna(how='any').index.unique()]
noAnyNan = raw.dropna()

In [10]:
noFullNan = raw.loc[raw[features_columns].dropna(how='all').index.unique()]

In [11]:
fullNan = raw.drop(raw[features_columns].dropna(how='all').index)

Predicitons

As a first try, we make predictions on the complete data, and return the 50th percentile and uncomplete and fully empty data


In [12]:
%%time
X,y=getXy(noAnyNan)


CPU times: user 1.89 s, sys: 17 ms, total: 1.91 s
Wall time: 1.92 s

In [13]:
%%time
#XX = [np.array(t).mean(0) for t in X]
XX = [np.append(np.nanmean(np.array(t),0),(np.array(t)[1:] - np.array(t)[:-1]).sum(0) ) for t in X]


CPU times: user 177 ms, sys: 5.26 ms, total: 182 ms
Wall time: 181 ms

In [17]:
t = X[0]

In [30]:
t


Out[30]:
array([[  1.        ,   2.        ,   9.        ,   5.        ,
          7.5       ,  10.5       ,  15.        ,  10.5       ,
         16.5       ,  23.5       ,   0.99833333,   0.99833333,
          0.99833333,   0.99833333,   0.375     ,  -0.125     ,
          0.3125    ,   0.875     ,   1.0599976 ,  -1.4100037 ,
         -0.3500061 ,   1.0599976 ],
       [ 16.        ,   2.        ,  18.        ,  14.        ,
         17.5       ,  21.        ,  20.5       ,  18.        ,
         20.5       ,  23.        ,   0.995     ,   0.995     ,
          0.99833333,   1.0016667 ,   0.25      ,   0.125     ,
          0.375     ,   0.6875    ,   0.34999084,  -1.0599976 ,
          0.        ,   1.0599976 ],
       [ 21.        ,   2.        ,  24.5       ,  16.5       ,
         21.        ,  24.5       ,  24.5       ,  21.        ,
         24.        ,  28.        ,   0.99833333,   0.995     ,
          0.99833333,   0.99833333,   0.25      ,   0.0625    ,
          0.1875    ,   0.5625    ,  -0.3500061 ,  -1.0599976 ,
         -0.3500061 ,   1.7599945 ],
       [ 26.        ,   2.        ,  12.        ,  12.        ,
         16.        ,  20.        ,  16.5       ,  17.        ,
         19.        ,  21.        ,   0.99833333,   0.995     ,
          0.99833333,   0.99833333,   0.5625    ,   0.25      ,
          0.4375    ,   0.6875    ,  -1.7600098 ,  -1.7600098 ,
         -0.3500061 ,   0.70999146],
       [ 31.        ,   2.        ,  22.5       ,  19.        ,
         22.        ,  25.        ,  26.        ,  23.5       ,
         25.5       ,  27.5       ,   0.99833333,   0.995     ,
          0.99833333,   1.0016667 ,   0.        ,  -0.1875    ,
          0.25      ,   0.625     ,  -1.0599976 ,  -2.1200104 ,
         -0.7100067 ,   0.34999084],
       [ 37.        ,   2.        ,  14.        ,  14.        ,
         18.5       ,  21.        ,  19.5       ,  20.        ,
         21.        ,  23.        ,   0.99833333,   0.9916667 ,
          0.99833333,   0.99833333,   0.5       ,   0.1875    ,
          0.4375    ,   0.8125    ,   0.        ,  -1.7600098 ,
         -0.3500061 ,   1.0599976 ],
       [ 42.        ,   2.        ,  12.        ,  11.        ,
         12.5       ,  17.        ,  19.5       ,  18.        ,
         21.        ,  23.        ,   0.99833333,   0.995     ,
          0.99833333,   0.99833333,   0.625     ,   0.375     ,
          0.625     ,   0.875     ,  -0.3500061 ,  -0.3500061 ,
          0.        ,   0.34999084],
       [ 47.        ,   2.        ,   1.5       ,   3.5       ,
          7.        ,  10.5       ,  18.        ,  16.5       ,
         18.5       ,  21.5       ,   0.99833333,   0.995     ,
          0.99833333,   0.99833333,   0.375     ,   0.1875    ,
          0.5       ,   0.6875    ,   0.34999084,  -2.1100006 ,
         -0.3500061 ,   1.0599976 ],
       [ 53.        ,   2.        ,  16.        ,  14.5       ,
         18.        ,  23.5       ,  28.        ,  23.5       ,
         26.5       ,  29.5       ,   0.99833333,   0.9916667 ,
          0.99833333,   0.99833333,   0.875     ,   0.625     ,
          0.9375    ,   1.375     ,  -0.3500061 ,  -1.4100037 ,
         -0.3500061 ,   2.119995  ]])

In [31]:
np.percentile(t,50,axis=0)


Out[31]:
array([ 31.        ,   2.        ,  14.        ,  14.        ,
        17.5       ,  21.        ,  19.5       ,  18.        ,
        21.        ,  23.        ,   0.99833333,   0.995     ,
         0.99833333,   0.99833333,   0.375     ,   0.1875    ,
         0.4375    ,   0.6875    ,  -0.3500061 ,  -1.4100037 ,
        -0.3500061 ,   1.0599976 ])

In [32]:
XX=[]
for t in X:
    #print(idx)
    
    tmp = np.append(np.nanmean(np.array(t),0),(np.array(t)[1:] - np.array(t)[:-1]).sum(0) )
    tmp = np.append(tmp,np.percentile(t,10,axis=0))
    tmp = np.append(tmp,np.percentile(t,50,axis=0))
    tmp = np.append(tmp,np.percentile(t,90,axis=0))
    test = t
    
    try:
        taa=test[:,0]
    except TypeError:
        taa=[test[0][0]]
    valid_time = np.zeros_like(taa)
    valid_time[0] = taa[0]
    for n in xrange(1,len(taa)):
        valid_time[n] = taa[n] - taa[n-1]
    valid_time[-1] = valid_time[-1] + 60 - np.sum(valid_time)
    valid_time = valid_time / 60.0

    
    sum=0
    try:
        column_ref=test[:,2]
    except TypeError:
        column_ref=[test[0][2]]
    for dbz, hours in zip(column_ref, valid_time):
        # See: https://en.wikipedia.org/wiki/DBZ_(meteorology)
        if np.isfinite(dbz):
            mmperhr = pow(pow(10, dbz/10)/200, 0.625)
            sum = sum + mmperhr * hours
            
    #XX.append(np.append(np.array(sum),tmp))
    #XX.append(np.array([sum]))
    XX.append(tmp)

In [16]:
XX[0]


Out[16]:
array([  4.48764871e-01,   3.04444444e+01,   2.00000000e+00,
         1.43888889e+01,   1.21666667e+01,   1.55555556e+01,
         1.92222222e+01,   2.08333333e+01,   1.86666667e+01,
         2.13888889e+01,   2.44444444e+01,   9.97962960e-01,
         9.94629637e-01,   9.98333330e-01,   9.99074079e-01,
         4.23611111e-01,   1.66666667e-01,   4.51388889e-01,
         7.98611111e-01,  -2.34449602e-01,  -1.44889326e+00,
        -3.12227033e-01,   1.05888367e+00,   5.20000000e+01,
         0.00000000e+00,   7.00000000e+00,   9.50000000e+00,
         1.05000000e+01,   1.30000000e+01,   1.30000000e+01,
         1.30000000e+01,   1.00000000e+01,   6.00000000e+00,
         0.00000000e+00,  -6.66663000e-03,   0.00000000e+00,
         0.00000000e+00,   5.00000000e-01,   7.50000000e-01,
         6.25000000e-01,   5.00000000e-01,  -1.41000370e+00,
         0.00000000e+00,   0.00000000e+00,   1.05999740e+00])

In [263]:
XX[2]


Out[263]:
array([ 0.59982168])
(X[0][1:] - X[0][:-1]).sum(0) (t[1:] - t[:-1]).sum(0) %%time etreg = etreg.fit(XX,y)

In [33]:
def splitTrainTest(X, y, split=0.2):
    tmp1, tmp2 = [], []
    ps = int(len(X) * (1-split))
    index_shuf = range(len(X))
    random.shuffle(index_shuf)
    for i in index_shuf:
        tmp1.append(X[i])
        tmp2.append(y[i])
    return tmp1[:ps], tmp2[:ps], tmp1[ps:], tmp2[ps:]

In [34]:
X_train,y_train, X_test, y_test = splitTrainTest(XX,y)


In [225]:
from sklearn import svm

In [283]:
svr = svm.SVR(C=100000)

In [284]:
%%time
srv = svr.fit(X_train,y_train)


CPU times: user 1.16 s, sys: 9.79 ms, total: 1.17 s
Wall time: 1.18 s

In [240]:
err = (svr.predict(X_test)-y_test)**2
err.sum()/len(err)


Out[240]:
68.94416640518908

In [285]:
err = (svr.predict(X_test)-y_test)**2
err.sum()/len(err)


Out[285]:
748.02998632366246

In [241]:
%%time
svr_score = cross_val_score(svr, XX, y, cv=5)
print("Score: %s\tMean: %.03f"%(svr_score,svr_score.mean()))


Score: [ 0.00178716 -0.01275117  0.00367807 -0.01291184 -0.00668543]	Mean: -0.005
CPU times: user 1.76 s, sys: 27.7 ms, total: 1.79 s
Wall time: 1.8 s


In [367]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn import grid_search

In [378]:
knn = KNeighborsRegressor(n_neighbors=6,weights='distance')

In [ ]:
parameters = {'weights':('distance','uniform'),'algorithm':('auto', 'ball_tree', 'kd_tree', 'brute')}
grid_knn = grid_search.GridSearchCV(knn, parameters,n_jobs=2)

In [ ]:
%%time
grid_knn.fit(X_train,y_train)

In [ ]:
print(grid_knn.grid_scores_)
print("Best: ",grid_knn.best_params_)

In [375]:
knn = grid_knn.best_estimator_

In [379]:
knn.fit(X_train,y_train)


Out[379]:
KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_neighbors=6, p=2, weights='distance')

In [380]:
err = (knn.predict(X_test)-y_test)**2
err.sum()/len(err)


Out[380]:
777.3484598901299


In [35]:
etreg = ExtraTreesRegressor(n_estimators=200, max_depth=None, min_samples_split=1, random_state=0)

In [69]:
parameters = {'n_estimators':range(100,200,20)}
grid_rf = grid_search.GridSearchCV(etreg, parameters,n_jobs=2)

In [70]:
%%time
grid_rf.fit(X_train,y_train)


CPU times: user 5.58 s, sys: 190 ms, total: 5.77 s
Wall time: 18 s
Out[70]:
GridSearchCV(cv=None, error_score='raise',
       estimator=ExtraTreesRegressor(bootstrap=False, criterion='mse', max_depth=None,
          max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
          min_samples_split=1, min_weight_fraction_leaf=0.0,
          n_estimators=100, n_jobs=1, oob_score=False, random_state=0,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, loss_func=None, n_jobs=2,
       param_grid={'n_estimators': [100, 120, 140, 160, 180]},
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
       verbose=0)

In [91]:
print(grid_rf.grid_scores_)
print("Best: ",grid_rf.best_params_)


[mean: 0.29327, std: 0.30796, params: {'n_estimators': 100}, mean: 0.30028, std: 0.30506, params: {'n_estimators': 120}, mean: 0.30427, std: 0.30391, params: {'n_estimators': 140}, mean: 0.30245, std: 0.30611, params: {'n_estimators': 160}, mean: 0.30627, std: 0.30341, params: {'n_estimators': 180}]
Best {'n_estimators': 180}

In [86]:
grid_rf.best_params_


Out[86]:
{'n_estimators': 180}

In [289]:
es = etreg
#es = grid_rf.best_estimator_

In [290]:
%%time
es = es.fit(X_train,y_train)


CPU times: user 18 s, sys: 110 ms, total: 18.1 s
Wall time: 18.1 s

In [291]:
err = (es.predict(X_test)-y_test)**2
err.sum()/len(err)


Out[291]:
749.35726462569858


In [296]:
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD

in_dim = len(XX[0])
out_dim = 1  

model = Sequential()
# Dense(64) is a fully-connected layer with 64 hidden units.
# in the first layer, you must specify the expected input data shape:
# here, 20-dimensional vectors.
model.add(Dense(128, input_shape=(in_dim,)))
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Dense(1, init='uniform'))
model.add(Activation('linear'))

#sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
#model.compile(loss='mean_squared_error', optimizer=sgd)

rms = RMSprop()
model.compile(loss='mean_squared_error', optimizer=rms)

#model.fit(X_train, y_train, nb_epoch=20, batch_size=16)
#score = model.evaluate(X_test, y_test, batch_size=16)

In [309]:
prep = []
for i in y_train:
    prep.append(min(i,20))

In [327]:
prep=np.array(prep)
mi,ma = prep.min(),prep.max()
fy = (prep-mi) / (ma-mi)
#my = fy.max()
#fy = fy/fy.max()

In [329]:
model.fit(np.array(X_train), fy, batch_size=10, nb_epoch=10, validation_split=0.1)


Train on 2226 samples, validate on 248 samples
Epoch 1/10
2226/2226 [==============================] - 0s - loss: 0.0460 - val_loss: 0.0492
Epoch 2/10
2226/2226 [==============================] - 0s - loss: 0.0462 - val_loss: 0.0510
Epoch 3/10
2226/2226 [==============================] - 0s - loss: 0.0470 - val_loss: 0.0485
Epoch 4/10
2226/2226 [==============================] - 0s - loss: 0.0462 - val_loss: 0.0477
Epoch 5/10
2226/2226 [==============================] - 0s - loss: 0.0463 - val_loss: 0.0490
Epoch 6/10
2226/2226 [==============================] - 0s - loss: 0.0455 - val_loss: 0.0495
Epoch 7/10
2226/2226 [==============================] - 0s - loss: 0.0461 - val_loss: 0.0501
Epoch 8/10
2226/2226 [==============================] - 0s - loss: 0.0451 - val_loss: 0.0554
Epoch 9/10
2226/2226 [==============================] - 0s - loss: 0.0455 - val_loss: 0.0475
Epoch 10/10
2226/2226 [==============================] - 0s - loss: 0.0444 - val_loss: 0.0496
Out[329]:
<keras.callbacks.History at 0x11c6ab6d0>

In [335]:
pred = model.predict(np.array(X_test))*ma+mi

In [336]:
err = (pred-y_test)**2
err.sum()/len(err)


Out[336]:
584504.6506384426

In [357]:
r = random.randrange(len(X_train))
print("(Train) Prediction %0.4f, True: %0.4f"%(model.predict(np.array([X_train[r]]))[0][0]*ma+mi,y_train[r]))

r = random.randrange(len(X_test))
print("(Test)  Prediction %0.4f, True: %0.4f"%(model.predict(np.array([X_test[r]]))[0][0]*ma+mi,y_test[r]))


(Train) Prediction 1.8161, True: 0.2540
(Test)  Prediction 2.5971, True: 2.5400


In [38]:
def marshall_palmer(ref, minutes_past):
    #print("Estimating rainfall from {0} observations".format(len(minutes_past)))
    # how long is each observation valid?
    valid_time = np.zeros_like(minutes_past)
    valid_time[0] = minutes_past.iloc[0]
    for n in xrange(1, len(minutes_past)):
        valid_time[n] = minutes_past.iloc[n] - minutes_past.iloc[n-1]
    valid_time[-1] = valid_time[-1] + 60 - np.sum(valid_time)
    valid_time = valid_time / 60.0

    # sum up rainrate * validtime
    sum = 0
    for dbz, hours in zip(ref, valid_time):
        # See: https://en.wikipedia.org/wiki/DBZ_(meteorology)
        if np.isfinite(dbz):
            mmperhr = pow(pow(10, dbz/10)/200, 0.625)
            sum = sum + mmperhr * hours
    return sum


def simplesum(ref,hour):
    hour.sum()

# each unique Id is an hour of data at some gauge
def myfunc(hour):
    #rowid = hour['Id'].iloc[0]
    # sort hour by minutes_past
    hour = hour.sort('minutes_past', ascending=True)
    est = marshall_palmer(hour['Ref'], hour['minutes_past'])
    return est

In [140]:
info = raw.groupby(raw.index)

In [39]:
estimates = raw.groupby(raw.index).apply(myfunc)
estimates.head(20)


Out[39]:
Id
1     0.000000
2     0.630814
3     0.000000
4     2.717599
5     0.000000
6     0.000000
7     0.040640
8     0.701645
9     0.000000
10    5.103416
11    0.599822
12    4.399598
13    0.000000
14    0.245304
15    0.390208
17    0.906720
18    0.353654
19    0.202986
20    0.000000
21    0.000000
dtype: float64

In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [113]:
%%time
etreg.fit(X_train,y_train)


CPU times: user 1.74 s, sys: 19.9 ms, total: 1.76 s
Wall time: 1.77 s
Out[113]:
ExtraTreesRegressor(bootstrap=False, criterion='mse', max_depth=None,
          max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
          min_samples_split=1, min_weight_fraction_leaf=0.0,
          n_estimators=100, n_jobs=1, oob_score=False, random_state=0,
          verbose=0, warm_start=False)

In [96]:
%%time
et_score = cross_val_score(etreg, XX, y, cv=5)
print("Score: %s\tMean: %.03f"%(et_score,et_score.mean()))


Score: [ 0.51407557  0.15938088  0.69138431  0.09297001  0.00505057]	Mean: 0.293
CPU times: user 4.46 s, sys: 70.5 ms, total: 4.54 s
Wall time: 4.55 s

In [114]:
%%time
et_score = cross_val_score(etreg, XX, y, cv=5)
print("Score: %s\tMean: %.03f"%(et_score,et_score.mean()))


Score: [ 0.53941779  0.1639901   0.68162502  0.12919835  0.00245396]	Mean: 0.303
CPU times: user 8.64 s, sys: 93.9 ms, total: 8.73 s
Wall time: 8.76 s

In [97]:
err = (etreg.predict(X_test)-y_test)**2
err.sum()/len(err)


Out[97]:
216.45597252585003

In [115]:
err = (etreg.predict(X_test)-y_test)**2
err.sum()/len(err)


Out[115]:
207.96284975354561

In [41]:
r = random.randrange(len(X_train))
print(r)
print(etreg.predict(X_train[r]))
print(y_train[r])

r = random.randrange(len(X_test))
print(r)
print(etreg.predict(X_test[r]))
print(y_test[r])


1767
[ 4.8260026]
4.8260026


In [13]:
%%time
#filename = "data/reduced_test_5000.csv"
filename = "data/test.csv"
test = pd.read_csv(filename)
test = test.set_index('Id')


CPU times: user 20.9 s, sys: 5.06 s, total: 25.9 s
Wall time: 26.8 s

In [14]:
features_columns = list([u'Ref', u'Ref_5x5_10th',
       u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
       u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
       u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
       u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
       u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
       u'Kdp_5x5_50th', u'Kdp_5x5_90th'])

def getX(raw):
    selected_columns = list([ u'radardist_km', u'Ref', u'Ref_5x5_10th',
       u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
       u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
       u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
       u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
       u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
       u'Kdp_5x5_50th', u'Kdp_5x5_90th'])
    
    data = raw[selected_columns]
    
    docX= []
    for i in data.index.unique():
        if isinstance(data.loc[i],pd.core.series.Series):
            m = [data.loc[i].as_matrix()]
            docX.append(m)
        else:
            m = data.loc[i].as_matrix()
            docX.append(m)
    X = np.array(docX)
    return X

In [74]:
X=getX(test)

tmp = []
for i in X:
    tmp.append(len(i))
tmp = np.array(tmp)
sns.countplot(tmp,order=range(tmp.min(),tmp.max()+1))
plt.title("Number of ID per number of observations\n(On test dataset)")
plt.plot()


Out[74]:
[]

In [15]:
testFull = test.dropna()

In [16]:
%%time
X=getX(testFull)  # 1min
XX = [np.array(t).mean(0) for t in X]  # 10s


CPU times: user 1min 12s, sys: 860 ms, total: 1min 13s
Wall time: 1min 14s

In [17]:
pd.DataFrame(etreg.predict(XX)).describe()


Out[17]:
0
count 235515.000000
mean 6.904250
std 7.382986
min 0.229492
25% 2.670935
50% 4.726779
75% 8.646297
max 344.055538

In [18]:
predFull = zip(testFull.index.unique(),etreg.predict(XX))

In [36]:
testNan = test.drop(test[features_columns].dropna(how='all').index)

In [37]:
tmp = np.empty(len(testNan))
tmp.fill(0.445000)   # 50th percentile of full Nan dataset
predNan = zip(testNan.index.unique(),tmp)

In [39]:
testLeft = test.drop(testNan.index.unique()).drop(testFull.index.unique())

In [40]:
tmp = np.empty(len(testLeft))
tmp.fill(1.27)   # 50th percentile of full Nan dataset
predLeft = zip(testLeft.index.unique(),tmp)

In [41]:
len(testFull.index.unique())


Out[41]:
235515

In [42]:
len(testNan.index.unique())


Out[42]:
232148

In [43]:
len(testLeft.index.unique())


Out[43]:
249962

In [44]:
pred = predFull + predNan + predLeft

In [45]:
pred.sort(key=lambda x: x[0], reverse=False)

In [46]:
submission = pd.DataFrame(pred)
submission.columns = ["Id","Expected"]
submission.head()


Out[46]:
Id Expected
0 1 1.270000
1 2 1.270000
2 3 8.287113
3 4 9.608633
4 5 0.445000

In [47]:
submission.to_csv("first_submit.csv",index=False)

In [ ]:


In [34]:
filename = "data/sample_solution.csv"
sol = pd.read_csv(filename)

In [28]:
ss = np.array(sol)

In [30]:
%%time
for a,b in predFull:
    ss[a-1][1]=b


CPU times: user 361 ms, sys: 28.7 ms, total: 390 ms
Wall time: 369 ms

In [41]:
ss


Out[41]:
array([[  1.00000000e+00,   8.57645300e-02],
       [  2.00000000e+00,   1.20000000e+01],
       [  3.00000000e+00,   8.28711271e+00],
       ..., 
       [  7.17623000e+05,   9.65720687e+00],
       [  7.17624000e+05,   1.29702251e+00],
       [  7.17625000e+05,   0.00000000e+00]])

In [45]:
sub = pd.DataFrame(ss)
sub.columns = ["Id","Expected"]
sub.Id = sub.Id.astype(int)
sub.head()


Out[45]:
Id Expected
0 1 0.085765
1 2 12.000000
2 3 8.287113
3 4 9.608633
4 5 0.000000

In [46]:
sub.to_csv("submit_2.csv",index=False)

In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]: