FDMS TME3

Florian Toque & Paul Willot



In [27]:

    
# -*- coding: utf-8 -*-

Notes

We tried different regressor model, like GBR, SVM, MLP, Random Forest and KNN as recommanded by the winning team of the Kaggle on taxi trajectories. So far GBR seems to be the best, slightly better than the RF.
The new features we exctracted only made a small impact on predictions but still improved them consistently.
We tried to use a LSTM to take advantage of the sequential structure of the data but it didn't work too well, probably because there is not enought data (13M lines divided per the average length of sequences (15), less the 30% of fully empty data)



In [1]:

    
# from __future__ import exam_success
from __future__ import absolute_import 
from __future__ import print_function  

# Standard imports
%matplotlib inline
import os
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import random
import pandas as pd
import scipy.stats as stats

# Sk cheats
from sklearn.cross_validation import cross_val_score
from sklearn import grid_search
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
#from sklearn.preprocessing import Imputer   # get rid of nan
from sklearn.decomposition import NMF        # to add features based on the latent representation
from sklearn.decomposition import ProjectedGradientNMF

# Faster gradient boosting
import xgboost as xgb

# For neural networks models
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation 
from keras.optimizers import SGD, RMSprop









    



/Library/Python/2.7/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))

13.765.202 lines in train.csv
8.022.757 lines in test.csv



In [12]:

    
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import nltk
import re
from nltk.stem import WordNetLemmatizer
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report
import sklearn.metrics
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import grid_search
from sklearn.linear_model import LogisticRegression

Few words about the dataset

Predictions is made in the USA corn growing states (mainly Iowa, Illinois, Indiana) during the season with the highest rainfall (as illustrated by Iowa for the april to august months)

The Kaggle page indicate that the dataset have been shuffled, so working on a subset seems acceptable
The test set is not a extracted from the same data as the training set however, which make the evaluation trickier

Load the dataset



In [3]:

    
%%time
#filename = "data/train.csv"
filename = "data/train.json"
#filename = "data/reduced_train_1000000.csv"
raw = pd.read_json(filename)
#raw = raw.set_index('Id')









    



CPU times: user 250 ms, sys: 32.8 ms, total: 283 ms
Wall time: 289 ms



In [22]:

    
traindf = raw
traindf['ingredients_clean_string'] = [' , '.join(z).strip() for z in traindf['ingredients']]  
traindf['ingredients_string'] = [' '.join([WordNetLemmatizer().lemmatize(re.sub('[^A-Za-z]', ' ', line)) for line in lists]).strip() for lists in traindf['ingredients']]       

#
testdf = pd.read_json("data/train.json") 
testdf['ingredients_clean_string'] = [' , '.join(z).strip() for z in testdf['ingredients']]
testdf['ingredients_string'] = [' '.join([WordNetLemmatizer().lemmatize(re.sub('[^A-Za-z]', ' ', line)) for line in lists]).strip() for lists in testdf['ingredients']]       



corpustr = traindf['ingredients_string']
vectorizertr = TfidfVectorizer(stop_words='english',
                             ngram_range = ( 1 , 1 ),analyzer="word", 
                             max_df = .57 , binary=False , token_pattern=r'\w+' , sublinear_tf=False)

tfidftr=vectorizertr.fit_transform(corpustr).todense()

#
corpusts = testdf['ingredients_string']

vectorizerts = TfidfVectorizer(stop_words='english')

#
tfidfts=vectorizertr.transform(corpusts)

predictors_tr = tfidftr

targets_tr = traindf['cuisine']

#
predictors_ts = tfidfts



In [122]:

    
raw.head()









    Out[122]:






  
    
      
      cuisine
      id
      ingredients
      ingredients_clean_string
      ingredients_string
    
  
  
    
      0
      greek
      10259
      [romaine lettuce, black olives, grape tomatoes...
      romaine lettuce , black olives , grape tomatoe...
      romaine lettuce black olives grape tomatoes ga...
    
    
      1
      southern_us
      25693
      [plain flour, ground pepper, salt, tomatoes, g...
      plain flour , ground pepper , salt , tomatoes ...
      plain flour ground pepper salt tomato ground b...
    
    
      2
      filipino
      20130
      [eggs, pepper, salt, mayonaise, cooking oil, g...
      eggs , pepper , salt , mayonaise , cooking oil...
      egg pepper salt mayonaise cooking oil green ch...
    
    
      3
      indian
      22213
      [water, vegetable oil, wheat, salt]
      water , vegetable oil , wheat , salt
      water vegetable oil wheat salt
    
    
      4
      indian
      13162
      [black pepper, shallots, cornflour, cayenne pe...
      black pepper , shallots , cornflour , cayenne ...
      black pepper shallot cornflour cayenne pepper ...



In [35]:

    
tfidftr[0]









    Out[35]:





matrix([[ 0.,  0.,  0., ...,  0.,  0.,  0.]])



In [18]:

    
targets_tr[:5]









    Out[18]:





0          greek
1    southern_us
2       filipino
3         indian
4         indian
Name: cuisine, dtype: object



In [36]:

    
labels = pd.get_dummies(targets_tr)



In [37]:

    
labels[:5]









    Out[37]:






  
    
      
      brazilian
      british
      cajun_creole
      chinese
      filipino
      french
      greek
      indian
      irish
      italian
      jamaican
      japanese
      korean
      mexican
      moroccan
      russian
      southern_us
      spanish
      thai
      vietnamese
    
  
  
    
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
    
    
      2
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      3
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      4
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0



In [ ]:



In [29]:

    
classifier = LogisticRegression()
classifier=classifier.fit(predictors_tr,targets_tr)



In [30]:

    
predictions=classifier.predict(predictors_ts)
testdf['cuisine'] = predictions
testdf = testdf.sort('id' , ascending=True)

#testdf[['id' , 'ingredients_clean_string' , 'cuisine' ]].to_csv("submission.csv")









    



/Library/Python/2.7/site-packages/ipykernel/__main__.py:3: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  app.launch_new_instance()



In [ ]:



In [ ]:

    
%%time
classifier = GradientBoostingRegressor()
classifier=classifier.fit(predictors_tr,targets_tr)

predictions=classifier.predict(predictors_ts)
testdf['cuisine'] = predictions
testdf = testdf.sort('id' , ascending=True)

testdf[['id' , 'ingredients_clean_string' , 'cuisine' ]].to_csv("submission.csv")



In [6]:

    
raw.head()









    Out[6]:






  
    
      
      cuisine
      id
      ingredients
    
  
  
    
      0
      greek
      10259
      [romaine lettuce, black olives, grape tomatoes...
    
    
      1
      southern_us
      25693
      [plain flour, ground pepper, salt, tomatoes, g...
    
    
      2
      filipino
      20130
      [eggs, pepper, salt, mayonaise, cooking oil, g...
    
    
      3
      indian
      22213
      [water, vegetable oil, wheat, salt]
    
    
      4
      indian
      13162
      [black pepper, shallots, cornflour, cayenne pe...



In [8]:

    
raw['ingredients'][0]









    Out[8]:





[u'romaine lettuce',
 u'black olives',
 u'grape tomatoes',
 u'garlic',
 u'pepper',
 u'purple onion',
 u'seasoning',
 u'garbanzo beans',
 u'feta cheese crumbles']

Gradient Boosting Regressor



In [141]:

    
# the dbz feature does not influence xgbr so much
xgbr = xgb.XGBRegressor(max_depth=6, learning_rate=0.1, n_estimators=700, silent=True,
                        objective='reg:linear', nthread=-1, gamma=0, min_child_weight=1,
                        max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1,
                        reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5,
                        seed=0, missing=None)



In [142]:

    
%%time
xgbr = xgbr.fit(X_train,y_train)









    



CPU times: user 1min 3s, sys: 1.75 s, total: 1min 5s
Wall time: 20.4 s



In [119]:

    
# without the nmf features
# print(xgbr.score(X_train,y_train))
## 0.993948231144
# print(xgbr.score(X_test,y_test))
## 0.613931733332









    



0.993948231144
0.613931733332



In [143]:

    
# with nmf features
print(xgbr.score(X_train,y_train))
print(xgbr.score(X_test,y_test))









    



0.999679564454
0.691887473333

Here for legacy



In [ ]:

    
# tfidftr, labels



In [46]:

    
np.shape(labels)[1]









    Out[46]:





20



In [108]:

    
#from keras.models import Sequential
#from keras.layers.core import Dense, Dropout, Activation
#from keras.optimizers import SGD

in_dim = np.shape(tfidftr)[1]
out_dim = np.shape(labels)[1]

model = Sequential()
# Dense(64) is a fully-connected layer with 64 hidden units.
# in the first layer, you must specify the expected input data shape:
# here, 20-dimensional vectors.
model.add(Dense(128, input_shape=(in_dim,)))
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Dense(out_dim, init='uniform'))
model.add(Activation('softmax'))


sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='mean_squared_error', optimizer=sgd)

#model.fit(X_train, y_train, nb_epoch=20, batch_size=16)
#score = model.evaluate(X_test, y_test, batch_size=16)



In [120]:

    
np.count_nonzero(tfidftr[4])









    Out[120]:





31



In [124]:

    
tfidftr[0]









    Out[124]:





matrix([[ 0.,  0.,  0., ...,  0.,  0.,  0.]])



In [126]:

    
labels









    Out[126]:






  
    
      
      brazilian
      british
      cajun_creole
      chinese
      filipino
      french
      greek
      indian
      irish
      italian
      jamaican
      japanese
      korean
      mexican
      moroccan
      russian
      southern_us
      spanish
      thai
      vietnamese
    
  
  
    
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
    
    
      2
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      3
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      4
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      5
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      6
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
    
    
      7
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      8
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      9
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      10
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      11
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      12
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      13
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      14
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      15
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      16
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      17
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      18
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
    
    
      19
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
    
    
      20
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
    
    
      21
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      22
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
    
    
      23
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      24
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      25
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      26
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      27
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      28
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      29
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      39744
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      39745
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
    
    
      39746
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      39747
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
    
    
      39748
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      39749
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      39750
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      39751
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
    
    
      39752
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
    
    
      39753
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      39754
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
    
    
      39755
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      39756
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      39757
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      39758
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      39759
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
    
    
      39760
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
    
    
      39761
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      39762
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      39763
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      39764
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      39765
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      39766
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      39767
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      39768
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      39769
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      39770
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      39771
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      39772
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      39773
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
  

39774 rows × 20 columns



In [109]:

    
model.fit(tfidftr, np.zeros(len(tfidftr)), nb_epoch=20, batch_size=16) #np.zeros(len(tfidftr))









    



Epoch 1/20






    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-109-37b0ec80c430> in <module>()
----> 1 model.fit(tfidftr, np.zeros(len(tfidftr)), nb_epoch=20, batch_size=16) #np.zeros(len(tfidftr))

/Library/Python/2.7/site-packages/keras/models.pyc in fit(self, X, y, batch_size, nb_epoch, verbose, callbacks, validation_split, validation_data, shuffle, show_accuracy, class_weight, sample_weight)
    487                          verbose=verbose, callbacks=callbacks,
    488                          val_f=val_f, val_ins=val_ins,
--> 489                          shuffle=shuffle, metrics=metrics)
    490 
    491     def predict(self, X, batch_size=128, verbose=0):

/Library/Python/2.7/site-packages/keras/models.pyc in _fit(self, f, ins, out_labels, batch_size, nb_epoch, verbose, callbacks, val_f, val_ins, shuffle, metrics)
    208                 batch_logs['size'] = len(batch_ids)
    209                 callbacks.on_batch_begin(batch_index, batch_logs)
--> 210                 outs = f(*ins_batch)
    211                 if type(outs) != list:
    212                     outs = [outs]

/Library/Python/2.7/site-packages/theano/compile/function_module.pyc in __call__(self, *args, **kwargs)
    869                     node=self.fn.nodes[self.fn.position_of_error],
    870                     thunk=thunk,
--> 871                     storage_map=getattr(self.fn, 'storage_map', None))
    872             else:
    873                 # old-style linkers raise their own exceptions

/Library/Python/2.7/site-packages/theano/gof/link.pyc in raise_with_op(node, thunk, exc_info, storage_map)
    312         # extra long error message in that case.
    313         pass
--> 314     reraise(exc_type, exc_value, exc_trace)
    315 
    316 

/Library/Python/2.7/site-packages/theano/compile/function_module.pyc in __call__(self, *args, **kwargs)
    857         t0_fn = time.time()
    858         try:
--> 859             outputs = self.fn()
    860         except Exception:
    861             if hasattr(self.fn, 'position_of_error'):

ValueError: Input dimension mis-match. (input[0].shape[1] = 20, input[1].shape[1] = 1)
Apply node that caused the error: Elemwise{Sub}[(0, 0)](AdvancedSubtensor1.0, AdvancedSubtensor1.0)
Toposort index: 44
Inputs types: [TensorType(float64, matrix), TensorType(float64, matrix)]
Inputs shapes: [(16, 20), (16, 1)]
Inputs strides: [(160, 8), (8, 8)]
Inputs values: ['not shown', 'not shown']
Outputs clients: [[Elemwise{Composite{((i0 * i1 * i2) / i3)}}(TensorConstant{(1, 1) of 2.0}, InplaceDimShuffle{0,x}.0, Elemwise{Sub}[(0, 0)].0, Elemwise{mul,no_inplace}.0), Elemwise{Sqr}[(0, 0)](Elemwise{Sub}[(0, 0)].0)]]

HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'.
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.



In [56]:

    
in_dim = np.shape(tfidftr)[1]
out_dim = np.shape(labels)[1]

model = Sequential()
# Dense(64) is a fully-connected layer with 64 hidden units.
# in the first layer, you must specify the expected input data shape:
# here, 20-dimensional vectors.
model.add(Dense(128, input_shape=(in_dim,)))
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Dense(out_dim, init='uniform'))
model.add(Activation('softmax'))

#sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
#model.compile(loss='mean_squared_error', optimizer=sgd)

rms = RMSprop()
model.compile(loss='mean_squared_error', optimizer=rms)

#model.fit(X_train, y_train, nb_epoch=20, batch_size=16)
#score = model.evaluate(X_test, y_test, batch_size=16)



In [54]:

    
print(np.shape(tfidftr))
print(np.shape(labels))









    



(39774, 2963)
(39774, 20)



In [58]:

    
model.fit(tfidftr,labels)









    



Epoch 1/100






    



---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-58-f325939e525c> in <module>()
----> 1 model.fit(tfidftr,labels)

/Library/Python/2.7/site-packages/keras/models.pyc in fit(self, X, y, batch_size, nb_epoch, verbose, callbacks, validation_split, validation_data, shuffle, show_accuracy, class_weight, sample_weight)
    487                          verbose=verbose, callbacks=callbacks,
    488                          val_f=val_f, val_ins=val_ins,
--> 489                          shuffle=shuffle, metrics=metrics)
    490 
    491     def predict(self, X, batch_size=128, verbose=0):

/Library/Python/2.7/site-packages/keras/models.pyc in _fit(self, f, ins, out_labels, batch_size, nb_epoch, verbose, callbacks, val_f, val_ins, shuffle, metrics)
    199                 batch_ids = index_array[batch_start:batch_end]
    200                 try:
--> 201                     ins_batch = slice_X(ins, batch_ids)
    202                 except TypeError as err:
    203                     raise Exception('TypeError while preparing batch. \

/Library/Python/2.7/site-packages/keras/models.pyc in slice_X(X, start, stop)
     53     if type(X) == list:
     54         if hasattr(start, '__len__'):
---> 55             return [x[start] for x in X]
     56         else:
     57             return [x[start:stop] for x in X]

/Library/Python/2.7/site-packages/pandas/core/frame.pyc in __getitem__(self, key)
   1961         if isinstance(key, (Series, np.ndarray, Index, list)):
   1962             # either boolean or fancy integer index
-> 1963             return self._getitem_array(key)
   1964         elif isinstance(key, DataFrame):
   1965             return self._getitem_frame(key)

/Library/Python/2.7/site-packages/pandas/core/frame.pyc in _getitem_array(self, key)
   2006         else:
   2007             indexer = self.ix._convert_to_indexer(key, axis=1)
-> 2008             return self.take(indexer, axis=1, convert=True)
   2009 
   2010     def _getitem_multilevel(self, key):

/Library/Python/2.7/site-packages/pandas/core/generic.pyc in take(self, indices, axis, convert, is_copy)
   1369         new_data = self._data.take(indices,
   1370                                    axis=self._get_block_manager_axis(axis),
-> 1371                                    convert=True, verify=True)
   1372         result = self._constructor(new_data).__finalize__(self)
   1373 

/Library/Python/2.7/site-packages/pandas/core/internals.pyc in take(self, indexer, axis, verify, convert)
   3617         n = self.shape[axis]
   3618         if convert:
-> 3619             indexer = maybe_convert_indices(indexer, n)
   3620 
   3621         if verify:

/Library/Python/2.7/site-packages/pandas/core/indexing.pyc in maybe_convert_indices(indices, n)
   1748     mask = (indices >= n) | (indices < 0)
   1749     if mask.any():
-> 1750         raise IndexError("indices are out-of-bounds")
   1751     return indices
   1752 

IndexError: indices are out-of-bounds



In [42]:

    
prep = []
for i in y_train:
    prep.append(min(i,20))



In [43]:

    
prep=np.array(prep)
mi,ma = prep.min(),prep.max()
fy = (prep-mi) / (ma-mi)
#my = fy.max()
#fy = fy/fy.max()



In [44]:

    
model.fit(np.array(X_train), fy, batch_size=10, nb_epoch=10, validation_split=0.1)









    



Train on 2224 samples, validate on 248 samples
Epoch 1/10
2224/2224 [==============================] - 0s - loss: 0.0628 - val_loss: 0.0506
Epoch 2/10
2224/2224 [==============================] - 0s - loss: 0.0537 - val_loss: 0.0509
Epoch 3/10
2224/2224 [==============================] - 0s - loss: 0.0517 - val_loss: 0.0541
Epoch 4/10
2224/2224 [==============================] - 0s - loss: 0.0520 - val_loss: 0.0500
Epoch 5/10
2224/2224 [==============================] - 0s - loss: 0.0519 - val_loss: 0.0525
Epoch 6/10
2224/2224 [==============================] - 0s - loss: 0.0523 - val_loss: 0.0489
Epoch 7/10
2224/2224 [==============================] - 0s - loss: 0.0517 - val_loss: 0.0551
Epoch 8/10
2224/2224 [==============================] - 0s - loss: 0.0512 - val_loss: 0.0498
Epoch 9/10
2224/2224 [==============================] - 0s - loss: 0.0516 - val_loss: 0.0583
Epoch 10/10
2224/2224 [==============================] - 0s - loss: 0.0516 - val_loss: 0.0497






    Out[44]:





<keras.callbacks.History at 0x112430c10>



In [45]:

    
pred = model.predict(np.array(X_test))*ma+mi



In [46]:

    
err = (pred-y_test)**2
err.sum()/len(err)









    Out[46]:





182460.82171163053



In [ ]:

    
r = random.randrange(len(X_train))
print("(Train) Prediction %0.4f, True: %0.4f"%(model.predict(np.array([X_train[r]]))[0][0]*ma+mi,y_train[r]))

r = random.randrange(len(X_test))
print("(Test)  Prediction %0.4f, True: %0.4f"%(model.predict(np.array([X_test[r]]))[0][0]*ma+mi,y_test[r]))

Predict on testset



In [338]:

    
%%time
filename = "data/reduced_test_5000.csv"
#filename = "data/test.csv"
test = pd.read_csv(filename)
test = test.set_index('Id')









    



CPU times: user 10.3 ms, sys: 2.9 ms, total: 13.2 ms
Wall time: 12.4 ms



In [339]:

    
features_columns = list([u'Ref', u'Ref_5x5_10th',
       u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
       u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
       u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
       u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
       u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
       u'Kdp_5x5_50th', u'Kdp_5x5_90th'])

def getX(raw):
    selected_columns = list([ u'minutes_past',u'radardist_km', u'Ref', u'Ref_5x5_10th',
       u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
       u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
       u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
       u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
       u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
       u'Kdp_5x5_50th', u'Kdp_5x5_90th'])
    
    data = raw[selected_columns]
    
    docX= []
    for i in data.index.unique():
        if isinstance(data.loc[i],pd.core.series.Series):
            m = [data.loc[i].as_matrix()]
            docX.append(m)
        else:
            m = data.loc[i].as_matrix()
            docX.append(m)
    X = np.array(docX)
    return X



In [340]:

    
#%%time
#X=getX(test)

#tmp = []
#for i in X:
#    tmp.append(len(i))
#tmp = np.array(tmp)
#sns.countplot(tmp,order=range(tmp.min(),tmp.max()+1))
#plt.title("Number of ID per number of observations\n(On test dataset)")
#plt.plot()



In [341]:

    
#testFull = test.dropna()
testNoFullNan = test.loc[test[features_columns].dropna(how='all').index.unique()]



In [342]:

    
%%time
X=getX(testNoFullNan)  # 1min
#XX = [np.array(t).mean(0) for t in X]  # 10s









    



CPU times: user 107 ms, sys: 2.27 ms, total: 109 ms
Wall time: 110 ms



In [ ]:



In [343]:

    
XX=[]
for t in X:
    nm = np.nanmean(t,0)
    for idx,j in enumerate(nm):
        if np.isnan(j):
            nm[idx]=global_means[idx]
    XX.append(nm)
XX=np.array(XX)

# rescale to clip min at 0 (for non negative matrix factorization)
XX_rescaled=XX[:,:]-np.min(XX,0)



In [344]:

    
%%time
W = nmf.transform(XX_rescaled)









    



CPU times: user 11.7 ms, sys: 2.26 ms, total: 13.9 ms
Wall time: 13.3 ms



In [345]:

    
XX=addFeatures(X,mf=W)



In [346]:

    
pd.DataFrame(xgbr.predict(XX)).describe()



In [348]:

    
reducedModelList = [knn,etreg,xgbr,gbr]
globalPred = np.array([f.predict(XX) for f in reducedModelList]).T
predTest = globalPred.mean(1)



In [100]:

    
predFull = zip(testNoFullNan.index.unique(),predTest)



In [101]:

    
testNan = test.drop(test[features_columns].dropna(how='all').index)



In [ ]:

    
pred = predFull + predNan



In [102]:

    
tmp = np.empty(len(testNan))
tmp.fill(0.445000)   # 50th percentile of full Nan dataset
predNan = zip(testNan.index.unique(),tmp)



In [103]:

    
testLeft = test.drop(testNan.index.unique()).drop(testFull.index.unique())



In [104]:

    
tmp = np.empty(len(testLeft))
tmp.fill(1.27)   # 50th percentile of full Nan dataset
predLeft = zip(testLeft.index.unique(),tmp)



In [105]:

    
len(testFull.index.unique())









    Out[105]:





235515



In [106]:

    
len(testNan.index.unique())









    Out[106]:





232148



In [107]:

    
len(testLeft.index.unique())









    Out[107]:





249962



In [108]:

    
pred = predFull + predNan + predLeft



In [113]:

    
pred.sort(key=lambda x: x[0], reverse=False)



In [ ]:

    
#reducedModelList = [knn,etreg,xgbr,gbr]
globalPred = np.array([f.predict(XX) for f in reducedModelList]).T
#globalPred.mean(1)



In [114]:

    
submission = pd.DataFrame(pred)
submission.columns = ["Id","Expected"]
submission.head()



In [115]:

    
submission.loc[submission['Expected']<0,'Expected'] = 0.445



In [116]:

    
submission.to_csv("submit4.csv",index=False)



In [ ]:



In [73]:

    
filename = "data/sample_solution.csv"
sol = pd.read_csv(filename)



In [74]:

    
sol









    Out[74]:






  
    
      
      Id
      Expected
    
  
  
    
      0
      1
      0.085765
    
    
      1
      2
      0.000000
    
    
      2
      3
      1.594004
    
    
      3
      4
      6.913380
    
    
      4
      5
      0.000000
    
    
      5
      6
      0.173935
    
    
      6
      7
      3.219921
    
    
      7
      8
      0.867394
    
    
      8
      9
      0.000000
    
    
      9
      10
      14.182371
    
    
      10
      11
      0.911013
    
    
      11
      12
      0.034835
    
    
      12
      13
      2.733501
    
    
      13
      14
      0.709341
    
    
      14
      15
      0.000000
    
    
      15
      16
      0.000000
    
    
      16
      17
      0.000000
    
    
      17
      18
      0.393315
    
    
      18
      19
      0.291799
    
    
      19
      20
      0.000000
    
    
      20
      21
      0.000000
    
    
      21
      22
      2.031596
    
    
      22
      23
      2.236317
    
    
      23
      24
      0.000000
    
    
      24
      25
      0.000000
    
    
      25
      26
      0.009609
    
    
      26
      27
      0.020464
    
    
      27
      28
      0.000000
    
    
      28
      29
      0.000000
    
    
      29
      30
      3.074862
    
    
      ...
      ...
      ...
    
    
      717595
      717596
      2.573862
    
    
      717596
      717597
      0.000000
    
    
      717597
      717598
      6.580513
    
    
      717598
      717599
      0.270776
    
    
      717599
      717600
      0.177539
    
    
      717600
      717601
      1.133367
    
    
      717601
      717602
      0.000000
    
    
      717602
      717603
      9.055868
    
    
      717603
      717604
      0.633148
    
    
      717604
      717605
      17.524065
    
    
      717605
      717606
      0.000000
    
    
      717606
      717607
      0.000000
    
    
      717607
      717608
      0.000000
    
    
      717608
      717609
      14.180502
    
    
      717609
      717610
      1.387969
    
    
      717610
      717611
      0.000000
    
    
      717611
      717612
      0.527286
    
    
      717612
      717613
      0.000000
    
    
      717613
      717614
      0.164476
    
    
      717614
      717615
      2.652251
    
    
      717615
      717616
      0.302655
    
    
      717616
      717617
      0.183060
    
    
      717617
      717618
      0.142695
    
    
      717618
      717619
      0.000000
    
    
      717619
      717620
      0.343296
    
    
      717620
      717621
      0.064034
    
    
      717621
      717622
      0.000000
    
    
      717622
      717623
      1.090277
    
    
      717623
      717624
      1.297023
    
    
      717624
      717625
      0.000000
    
  

717625 rows × 2 columns



In [ ]:

    
ss = np.array(sol)



In [ ]:

    
%%time
for a,b in predFull:
    ss[a-1][1]=b



In [ ]:

    
ss



In [75]:

    
sub = pd.DataFrame(pred)
sub.columns = ["Id","Expected"]
sub.Id = sub.Id.astype(int)
sub.head()



In [76]:

    
sub.to_csv("submit3.csv",index=False)



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:

	0
count	270.000000
mean	39.863171
std	19.231586
min	12.168386
25%	25.052711
50%	32.528101
75%	56.126141
max	121.855644

	Id	Expected
0	1	1.270000
1	2	1.270000
2	3	2.361996
3	4	14.492731
4	5	0.445000

	Id	Expected
0	1	1.270000
1	2	1.270000
2	3	2.378660
3	4	8.851727
4	5	0.445000

	cuisine	id	ingredients	ingredients_clean_string	ingredients_string
0	greek	10259	[romaine lettuce, black olives, grape tomatoes...	romaine lettuce , black olives , grape tomatoe...	romaine lettuce black olives grape tomatoes ga...
1	southern_us	25693	[plain flour, ground pepper, salt, tomatoes, g...	plain flour , ground pepper , salt , tomatoes ...	plain flour ground pepper salt tomato ground b...
2	filipino	20130	[eggs, pepper, salt, mayonaise, cooking oil, g...	eggs , pepper , salt , mayonaise , cooking oil...	egg pepper salt mayonaise cooking oil green ch...
3	indian	22213	[water, vegetable oil, wheat, salt]	water , vegetable oil , wheat , salt	water vegetable oil wheat salt
4	indian	13162	[black pepper, shallots, cornflour, cayenne pe...	black pepper , shallots , cornflour , cayenne ...	black pepper shallot cornflour cayenne pepper ...

	filipino	greek	indian	southern_us
0	0	1	0	0
1	0	0	0	1
2	1	0	0	0
3	0	0	1	0
4	0	0	1	0

	brazilian	british	cajun_creole	chinese	filipino	french	greek	indian	irish	italian	jamaican	japanese	korean	mexican	moroccan	russian	southern_us	spanish	thai	vietnamese
0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0
2	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0
5	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0
6	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0
7	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0
8	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0
9	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0
10	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0
11	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
12	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0
13	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0
14	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0
15	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0
16	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
17	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0
18	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0
19	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
20	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0
21	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0
22	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0
23	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
24	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0
25	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
26	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
27	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0
28	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
29	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
39744	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0
39745	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0
39746	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0
39747	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0
39748	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0
39749	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0
39750	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0
39751	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0
39752	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0
39753	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0
39754	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
39755	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0
39756	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0
39757	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0
39758	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0
39759	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0
39760	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0
39761	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
39762	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0
39763	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0
39764	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0
39765	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0
39766	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0
39767	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0
39768	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0
39769	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0
39770	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0
39771	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0
39772	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
39773	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0

	Id	Expected
0	1	0.085765
1	2	0.000000
2	3	1.594004
3	4	6.913380
4	5	0.000000
5	6	0.173935
6	7	3.219921
7	8	0.867394
8	9	0.000000
9	10	14.182371
10	11	0.911013
11	12	0.034835
12	13	2.733501
13	14	0.709341
14	15	0.000000
15	16	0.000000
16	17	0.000000
17	18	0.393315
18	19	0.291799
19	20	0.000000
20	21	0.000000
21	22	2.031596
22	23	2.236317
23	24	0.000000
24	25	0.000000
25	26	0.009609
26	27	0.020464
27	28	0.000000
28	29	0.000000
29	30	3.074862
...	...	...
717595	717596	2.573862
717596	717597	0.000000
717597	717598	6.580513
717598	717599	0.270776
717599	717600	0.177539
717600	717601	1.133367
717601	717602	0.000000
717602	717603	9.055868
717603	717604	0.633148
717604	717605	17.524065
717605	717606	0.000000
717606	717607	0.000000
717607	717608	0.000000
717608	717609	14.180502
717609	717610	1.387969
717610	717611	0.000000
717611	717612	0.527286
717612	717613	0.000000
717613	717614	0.164476
717614	717615	2.652251
717615	717616	0.302655
717616	717617	0.183060
717617	717618	0.142695
717618	717619	0.000000
717619	717620	0.343296
717620	717621	0.064034
717621	717622	0.000000
717622	717623	1.090277
717623	717624	1.297023
717624	717625	0.000000

	filipino	greek	indian	southern_us
0	0	1	0	0
1	0	0	0	1
2	1	0	0	0
3	0	0	1	0
4	0	0	1	0

	filipino	greek	indian	southern_us
0	0	1	0	0
1	0	0	0	1
2	1	0	0	0
3	0	0	1	0
4	0	0	1	0