TalkingData -- Feature Engineering

In the third round of feature engineering, we construct features on app usage. This includes three types of features. First, simple counts on the number of installed and active apps. Second, a GMM-analysis of crosstab features. Third, bag of installed apps together with embeddings provided by a neural network.


In [1]:
import math
import numpy as np
import pandas as pd
import pickle
import time

from keras.callbacks import EarlyStopping
from keras.models import Sequential
from keras.layers.core import Dense, Layer, Dropout, Activation
from keras.optimizers import SGD

from scipy import sparse
from scipy.sparse import coo_matrix
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.cross_validation import train_test_split, StratifiedKFold
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import log_loss 
from sklearn.mixture import GMM
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

#minimum number of times an app is used so that it will be counted
MIN_COUNT = 10

#paths to data and features
DATA_PATH = "../../../input/"
FEATURE_PATH = "../../../features/"

#number of classes to be predicted
NCLASSES = 12

#seed for randomness
SEED = 1747
np.random.seed(SEED)

#number of clusters for GMM
NCLUST = 2

########################
##BLENDING_CONFIG
########################

#number of folds used in blending
N_FOLDS = 4

########################
##NEURAL NETWORK CONFIGS
########################

#batch size for neural network
BATCH_SIZE = 64

#size of the hidden layer in the neural network
HIDDEN_SIZE =  50

#probability for dropout
DROPOUT_PROB = 50

#Number of neueral networks used for bagging
N_NNETS = 4


Using Theano backend.

Next, we load the apps data and join with the device data. We also perform a more concise encoding of ids.


In [2]:
device_apps = pickle.load(open('{0}device_apps_inner'.format(DATA_PATH),'rb'))
device_apps = device_apps.dropna()
train_indices = pickle.load(open('{0}train_event_ids'.format(DATA_PATH),'rb'))
test_indices = pickle.load(open('{0}test_event_ids'.format(DATA_PATH),'rb'))
train_test = pd.concat([pd.Series(train_indices).to_frame(),pd.Series(test_indices).to_frame()])
apps = pd.read_csv('{0}app_labels.csv'.format(DATA_PATH))
apps['app_id'] = apps['app_id'].astype(float)

did_enc = LabelEncoder().fit(train_test['device_id'])
train_test['device_id'] = did_enc.transform(train_test['device_id']).astype(np.int32)
train_indices = did_enc.transform(train_indices)
test_indices = did_enc.transform(test_indices)
device_apps['device_id'] = did_enc.transform(device_apps['device_id']).astype(np.int32)

app_enc = LabelEncoder().fit(np.hstack([device_apps['app_id'], apps['app_id']]))
device_apps['app_id'] = app_enc.transform(device_apps['app_id']).astype(np.int32)
apps['app_id'] = app_enc.transform(apps['app_id']).astype(np.int32)

start = time.clock()
device_apps_inner = train_test.merge(device_apps, 'left',   on = 'device_id')

device_label = device_apps_inner[['device_id', 'app_id']].merge(apps, 'left', on = 'app_id')[['device_id', 'label_id']]
device_label_small = device_label.groupby(['device_id', 'label_id'], sort = False).size().reset_index()[['device_id', 'label_id']]

print(time.clock() - start)
del apps
del device_apps


18.453177000000004

We filter out rarely used apps.


In [3]:
apps_train = device_apps_inner[device_apps_inner['device_id'].isin(train_indices)]['app_id']
apps_test = device_apps_inner[device_apps_inner['device_id'].isin(test_indices)]['app_id']

train_counts = apps_train.value_counts()
test_counts = apps_test.value_counts()

relevant_train = train_counts[train_counts > MIN_COUNT].index
relevant_test = test_counts[test_counts > MIN_COUNT].index
relevant_apps = np.intersect1d(relevant_train, relevant_test)

device_apps_filtered = device_apps_inner[device_apps_inner['app_id'].isin(relevant_apps)][['device_id', 'app_id']]

Total app counts (2 features)

As a first feature, we count the numbers of installed and active apps.


In [4]:
app_cnt = device_apps_inner.groupby('device_id', sort =False)[['is_installed','is_active']].sum()
app_cnt_names = ['installed_cnt', 'active_cnt']
del device_apps_inner

Bags of features (3892 + 473 features)

Next, we introduce a function to create a tf-idf vectorized bag of features...


In [4]:
def make_bag(data, index, feature_name, col_prefix):
    bag_of_features_raw = data.groupby(index, as_index = False, sort = False).aggregate(lambda x:  
                                                                                      ' '.join([str(word) for word in list(x)]))
    bag_of_features  = train_test.merge(bag_of_features_raw, 'left', on = 'device_id').drop_duplicates().fillna('')
    
    vectorizer = TfidfVectorizer(analyzer = "word", tokenizer = None,  min_df=3,  max_features=None, use_idf=1,
                                 smooth_idf=1, sublinear_tf=1, preprocessor = None, stop_words = None)
    vectorized_bof = vectorizer.fit_transform(bag_of_features[feature_name])
    
    names = ['{0}{1}'.format(col_prefix, name) for name in vectorizer.get_feature_names()]
    return (names, vectorized_bof)

... and apply it to the installed apps.


In [5]:
start = time.clock()
(boa_names, bag_of_apps) = make_bag(device_apps_filtered, 'device_id', 'app_id', 'boa')
print(time.clock()- start)
(bol_names, bag_of_labels) = make_bag(device_label, 'device_id', 'label_id', 'bol')
print(time.clock()- start)

bags = sparse.csr_matrix(sparse.hstack([bag_of_apps, bag_of_labels]))
bags_names = np.hstack([boa_names, bol_names])


26.194353
163.55957999999998

Neural network embedding (48 features)

We use a Keras to train a single-layer neural network and keep the hidden layer as feature for later classifiers. The architecture is taken from https://www.kaggle.com/chechir/talkingdata-mobile-user-demographics/keras-on-labels-and-brands.

As preliminary step, we fetch the training data, filter for devices with events and extract the labels.

Data preparation


In [6]:
train = pd.read_csv('{0}gender_age_train.csv'.format(DATA_PATH)).loc[:, ['device_id', 'group']]
train_event = train[train['device_id'].isin(did_enc.classes_)]
train_event['device_id'] = did_enc.transform(train_event['device_id'])
labels = LabelEncoder().fit_transform(train_event['group'])


-c:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

To prevent overfitting, we use a blend-type split of the training data.


In [7]:
def blend_split(X, y, n_folds):
    #first, generate holdout set
    X_train, X_ho, y_train, y_ho = train_test_split(X, y, stratify = y, train_size = 0.8, random_state = SEED)
    
    #second, split the remaining set
    
    ind_train, ind_test, _, _ = train_test_split(range(len(y_train)), y_train, stratify = y_train, train_size = 0.5, 
                                               random_state = SEED)
    return [StratifiedKFold(y_train, n_folds = n_folds, shuffle = True, random_state = SEED), X_train, X_ho, y_train, y_ho]

We split the training indices and the bagged features.


In [8]:
ind_split = blend_split(train_indices,labels, N_FOLDS)
bags_split = blend_split(bags[:labels.shape[0]], labels, N_FOLDS)

Neural network auxiliary functions

Since the bag-of-features is a sparse matrix, we need a generator to feed it into Keras.


In [9]:
def batches(X):
    idxs = range(X.shape[0])
    return [idxs[i:i + BATCH_SIZE] for i in range(0, len(idxs), BATCH_SIZE)]  
    
def sparse_gen(X, y):
    y_enc = OneHotEncoder().fit_transform(y.reshape(-1,1)).toarray()
    while 1:
        batch_list = batches(X)
        for batch in batch_list:
            yield (X[batch, :].toarray(), y_enc[batch, :])
            
def batch_predict(clf, X):
    batch_list = batches(X)
    return np.vstack([clf.predict(X[batch,:].toarray()) for batch in batch_list])

As suggested in https://www.kaggle.com/chechir/talkingdata-mobile-user-demographics/keras-on-labels-and-brands, a neural network with a single hidden layer plus dropout.


In [10]:
def baseline_model():
    model = Sequential()
    model.add(Dense(HIDDEN_SIZE, input_dim=bags.shape[1], init = 'normal', activation = 'tanh'))
    model.add(Dropout(DROPOUT_PROB))
    model.add(Dense(NCLASSES, init = 'normal', activation = 'sigmoid'))
    model.compile(loss = 'categorical_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
    return model

Feature generation for training data

Next, we fit the neural networks on one of the blending halves and generate the hidden layer when performing predictions on the second half. In order to prepare bagging, we fit two networks with different initial data.


In [11]:
def keras_fit_folds(data, nb_epoch):
    [skf, X_train, X_val, y_train, y_val] = data
    result_nnets = []
    for train_index, test_index in skf:        
            nnets = keras_fit(X_train[train_index], X_train[test_index], y_train[train_index], y_train[test_index], 
                              N_NNETS, nb_epoch)
            result_nnets = result_nnets + [nnets]
    return result_nnets

def keras_fit(X_train, X_val, y_train,  y_val, bag_num, nb_epoch):
    result_nnets = []
    for _ in range(bag_num):
        nnet = baseline_model()
        nnet.fit_generator(sparse_gen(X_train, y_train),
                           samples_per_epoch = X_train.shape[0], 
                           validation_data = sparse_gen(X_val, y_val),
                           callbacks = [EarlyStopping(monitor='val_loss', patience=0, verbose=1)],
                           nb_val_samples = X_val.shape[0],
                           nb_epoch = nb_epoch,
                           verbose = 2)
        result_nnets += [nnet]
    return result_nnets

For predictions on the training set, we fit on the folds. For predictions on the test set, we fit on the entire training set.


In [12]:
nnets_train = keras_fit_folds(bags_split, nb_epoch = 6)
nnets_test = keras_fit(bags_split[1], bags_split[2], bags_split[3], bags_split[4], N_NNETS, nb_epoch = 5)


Epoch 1/6
44s - loss: 2.3449 - acc: 0.1687 - val_loss: 2.2232 - val_acc: 0.2038
Epoch 2/6
45s - loss: 2.1255 - acc: 0.2454 - val_loss: 2.0746 - val_acc: 0.2689
Epoch 3/6
45s - loss: 1.9784 - acc: 0.3086 - val_loss: 2.0052 - val_acc: 0.3030
Epoch 4/6
45s - loss: 1.8753 - acc: 0.3507 - val_loss: 1.9771 - val_acc: 0.3107
Epoch 5/6
45s - loss: 1.7919 - acc: 0.3845 - val_loss: 1.9699 - val_acc: 0.3145
Epoch 6/6
Epoch 00005: early stopping
45s - loss: 1.7154 - acc: 0.4175 - val_loss: 1.9753 - val_acc: 0.3098
Epoch 1/6
45s - loss: 2.3454 - acc: 0.1626 - val_loss: 2.2229 - val_acc: 0.2048
Epoch 2/6
45s - loss: 2.1241 - acc: 0.2456 - val_loss: 2.0711 - val_acc: 0.2700
Epoch 3/6
45s - loss: 1.9758 - acc: 0.3089 - val_loss: 2.0027 - val_acc: 0.3070
Epoch 4/6
45s - loss: 1.8764 - acc: 0.3480 - val_loss: 1.9771 - val_acc: 0.3126
Epoch 5/6
45s - loss: 1.7977 - acc: 0.3791 - val_loss: 1.9711 - val_acc: 0.3118
Epoch 6/6
Epoch 00005: early stopping
45s - loss: 1.7253 - acc: 0.4084 - val_loss: 1.9763 - val_acc: 0.3070
Epoch 1/6
46s - loss: 2.3531 - acc: 0.1634 - val_loss: 2.2346 - val_acc: 0.2027
Epoch 2/6
45s - loss: 2.1327 - acc: 0.2418 - val_loss: 2.0778 - val_acc: 0.2766
Epoch 3/6
46s - loss: 1.9810 - acc: 0.3044 - val_loss: 2.0068 - val_acc: 0.3028
Epoch 4/6
45s - loss: 1.8756 - acc: 0.3480 - val_loss: 1.9780 - val_acc: 0.3120
Epoch 5/6
45s - loss: 1.7898 - acc: 0.3807 - val_loss: 1.9701 - val_acc: 0.3130
Epoch 6/6
Epoch 00005: early stopping
45s - loss: 1.7104 - acc: 0.4137 - val_loss: 1.9751 - val_acc: 0.3096
Epoch 1/6
46s - loss: 2.3515 - acc: 0.1614 - val_loss: 2.2341 - val_acc: 0.2016
Epoch 2/6
45s - loss: 2.1321 - acc: 0.2398 - val_loss: 2.0744 - val_acc: 0.2674
Epoch 3/6
45s - loss: 1.9778 - acc: 0.3083 - val_loss: 2.0049 - val_acc: 0.3000
Epoch 4/6
46s - loss: 1.8736 - acc: 0.3504 - val_loss: 1.9784 - val_acc: 0.3088
Epoch 5/6
45s - loss: 1.7893 - acc: 0.3841 - val_loss: 1.9723 - val_acc: 0.3077
Epoch 6/6
Epoch 00005: early stopping
44s - loss: 1.7118 - acc: 0.4160 - val_loss: 1.9788 - val_acc: 0.3038
Epoch 1/6
45s - loss: 2.3389 - acc: 0.1667 - val_loss: 2.2215 - val_acc: 0.1952
Epoch 2/6
44s - loss: 2.1143 - acc: 0.2468 - val_loss: 2.0857 - val_acc: 0.2565
Epoch 3/6
45s - loss: 1.9686 - acc: 0.3103 - val_loss: 2.0211 - val_acc: 0.2949
Epoch 4/6
44s - loss: 1.8677 - acc: 0.3537 - val_loss: 1.9960 - val_acc: 0.3074
Epoch 5/6
45s - loss: 1.7859 - acc: 0.3855 - val_loss: 1.9911 - val_acc: 0.3115
Epoch 6/6
Epoch 00005: early stopping
45s - loss: 1.7098 - acc: 0.4191 - val_loss: 1.9987 - val_acc: 0.3085
Epoch 1/6
46s - loss: 2.3391 - acc: 0.1653 - val_loss: 2.2218 - val_acc: 0.1898
Epoch 2/6
46s - loss: 2.1180 - acc: 0.2377 - val_loss: 2.0904 - val_acc: 0.2484
Epoch 3/6
45s - loss: 1.9736 - acc: 0.3043 - val_loss: 2.0213 - val_acc: 0.2887
Epoch 4/6
45s - loss: 1.8696 - acc: 0.3474 - val_loss: 1.9936 - val_acc: 0.3091
Epoch 5/6
45s - loss: 1.7881 - acc: 0.3763 - val_loss: 1.9882 - val_acc: 0.3108
Epoch 6/6
Epoch 00005: early stopping
45s - loss: 1.7143 - acc: 0.4082 - val_loss: 1.9953 - val_acc: 0.3106
Epoch 1/6
46s - loss: 2.3363 - acc: 0.1765 - val_loss: 2.2176 - val_acc: 0.2070
Epoch 2/6
45s - loss: 2.1153 - acc: 0.2462 - val_loss: 2.0916 - val_acc: 0.2477
Epoch 3/6
45s - loss: 1.9722 - acc: 0.3074 - val_loss: 2.0253 - val_acc: 0.2900
Epoch 4/6
45s - loss: 1.8675 - acc: 0.3510 - val_loss: 1.9988 - val_acc: 0.3061
Epoch 5/6
45s - loss: 1.7833 - acc: 0.3830 - val_loss: 1.9941 - val_acc: 0.3091
Epoch 6/6
Epoch 00005: early stopping
45s - loss: 1.7053 - acc: 0.4190 - val_loss: 2.0023 - val_acc: 0.3080
Epoch 1/6
45s - loss: 2.3433 - acc: 0.1725 - val_loss: 2.2266 - val_acc: 0.2042
Epoch 2/6
45s - loss: 2.1184 - acc: 0.2514 - val_loss: 2.0857 - val_acc: 0.2589
Epoch 3/6
45s - loss: 1.9652 - acc: 0.3161 - val_loss: 2.0164 - val_acc: 0.3001
Epoch 4/6
45s - loss: 1.8587 - acc: 0.3541 - val_loss: 1.9929 - val_acc: 0.3089
Epoch 5/6
44s - loss: 1.7740 - acc: 0.3883 - val_loss: 1.9908 - val_acc: 0.3102
Epoch 6/6
Epoch 00005: early stopping
44s - loss: 1.6966 - acc: 0.4200 - val_loss: 2.0014 - val_acc: 0.3061
Epoch 1/6
45s - loss: 2.3435 - acc: 0.1629 - val_loss: 2.2213 - val_acc: 0.2103
Epoch 2/6
45s - loss: 2.1158 - acc: 0.2521 - val_loss: 2.0759 - val_acc: 0.2734
Epoch 3/6
44s - loss: 1.9650 - acc: 0.3123 - val_loss: 2.0136 - val_acc: 0.2970
Epoch 4/6
45s - loss: 1.8639 - acc: 0.3522 - val_loss: 1.9904 - val_acc: 0.3071
Epoch 5/6
45s - loss: 1.7818 - acc: 0.3861 - val_loss: 1.9853 - val_acc: 0.3122
Epoch 6/6
Epoch 00005: early stopping
45s - loss: 1.7052 - acc: 0.4192 - val_loss: 1.9918 - val_acc: 0.3079
Epoch 1/6
45s - loss: 2.3414 - acc: 0.1712 - val_loss: 2.2221 - val_acc: 0.2088
Epoch 2/6
45s - loss: 2.1235 - acc: 0.2440 - val_loss: 2.0875 - val_acc: 0.2672
Epoch 3/6
45s - loss: 1.9801 - acc: 0.3074 - val_loss: 2.0203 - val_acc: 0.2953
Epoch 4/6
45s - loss: 1.8753 - acc: 0.3473 - val_loss: 1.9922 - val_acc: 0.3047
Epoch 5/6
45s - loss: 1.7915 - acc: 0.3811 - val_loss: 1.9844 - val_acc: 0.3103
Epoch 6/6
Epoch 00005: early stopping
45s - loss: 1.7148 - acc: 0.4162 - val_loss: 1.9890 - val_acc: 0.3075
Epoch 1/6
45s - loss: 2.3445 - acc: 0.1690 - val_loss: 2.2269 - val_acc: 0.2041
Epoch 2/6
45s - loss: 2.1217 - acc: 0.2478 - val_loss: 2.0815 - val_acc: 0.2723
Epoch 3/6
45s - loss: 1.9722 - acc: 0.3139 - val_loss: 2.0145 - val_acc: 0.2989
Epoch 4/6
45s - loss: 1.8695 - acc: 0.3510 - val_loss: 1.9887 - val_acc: 0.3090
Epoch 5/6
45s - loss: 1.7865 - acc: 0.3816 - val_loss: 1.9818 - val_acc: 0.3129
Epoch 6/6
Epoch 00005: early stopping
45s - loss: 1.7095 - acc: 0.4155 - val_loss: 1.9870 - val_acc: 0.3109
Epoch 1/6
45s - loss: 2.3432 - acc: 0.1643 - val_loss: 2.2273 - val_acc: 0.2049
Epoch 2/6
44s - loss: 2.1213 - acc: 0.2486 - val_loss: 2.0827 - val_acc: 0.2734
Epoch 3/6
44s - loss: 1.9729 - acc: 0.3104 - val_loss: 2.0167 - val_acc: 0.2966
Epoch 4/6
45s - loss: 1.8692 - acc: 0.3500 - val_loss: 1.9904 - val_acc: 0.3073
Epoch 5/6
44s - loss: 1.7843 - acc: 0.3824 - val_loss: 1.9833 - val_acc: 0.3157
Epoch 6/6
Epoch 00005: early stopping
45s - loss: 1.7054 - acc: 0.4175 - val_loss: 1.9886 - val_acc: 0.3137
Epoch 1/6
45s - loss: 2.3417 - acc: 0.1668 - val_loss: 2.2173 - val_acc: 0.1982
Epoch 2/6
44s - loss: 2.1230 - acc: 0.2432 - val_loss: 2.0774 - val_acc: 0.2574
Epoch 3/6
45s - loss: 1.9744 - acc: 0.3116 - val_loss: 2.0172 - val_acc: 0.3006
Epoch 4/6
45s - loss: 1.8681 - acc: 0.3519 - val_loss: 1.9966 - val_acc: 0.3023
Epoch 5/6
45s - loss: 1.7849 - acc: 0.3818 - val_loss: 1.9932 - val_acc: 0.3079
Epoch 6/6
Epoch 00005: early stopping
45s - loss: 1.7092 - acc: 0.4134 - val_loss: 1.9999 - val_acc: 0.3057
Epoch 1/6
45s - loss: 2.3507 - acc: 0.1676 - val_loss: 2.2294 - val_acc: 0.2044
Epoch 2/6
45s - loss: 2.1297 - acc: 0.2503 - val_loss: 2.0774 - val_acc: 0.2656
Epoch 3/6
45s - loss: 1.9729 - acc: 0.3107 - val_loss: 2.0181 - val_acc: 0.3016
Epoch 4/6
45s - loss: 1.8687 - acc: 0.3513 - val_loss: 1.9994 - val_acc: 0.3079
Epoch 5/6
45s - loss: 1.7862 - acc: 0.3815 - val_loss: 1.9964 - val_acc: 0.3059
Epoch 6/6
Epoch 00005: early stopping
45s - loss: 1.7093 - acc: 0.4163 - val_loss: 2.0035 - val_acc: 0.3070
Epoch 1/6
46s - loss: 2.3495 - acc: 0.1612 - val_loss: 2.2272 - val_acc: 0.2007
Epoch 2/6
45s - loss: 2.1247 - acc: 0.2536 - val_loss: 2.0714 - val_acc: 0.2699
Epoch 3/6
45s - loss: 1.9680 - acc: 0.3132 - val_loss: 2.0125 - val_acc: 0.3021
Epoch 4/6
45s - loss: 1.8635 - acc: 0.3505 - val_loss: 1.9927 - val_acc: 0.3102
Epoch 5/6
45s - loss: 1.7816 - acc: 0.3829 - val_loss: 1.9893 - val_acc: 0.3085
Epoch 6/6
Epoch 00005: early stopping
45s - loss: 1.7069 - acc: 0.4119 - val_loss: 1.9962 - val_acc: 0.3057
Epoch 1/6
46s - loss: 2.3463 - acc: 0.1708 - val_loss: 2.2230 - val_acc: 0.2048
Epoch 2/6
45s - loss: 2.1296 - acc: 0.2498 - val_loss: 2.0805 - val_acc: 0.2623
Epoch 3/6
45s - loss: 1.9786 - acc: 0.3109 - val_loss: 2.0180 - val_acc: 0.3012
Epoch 4/6
45s - loss: 1.8709 - acc: 0.3503 - val_loss: 1.9972 - val_acc: 0.3074
Epoch 5/6
45s - loss: 1.7861 - acc: 0.3825 - val_loss: 1.9937 - val_acc: 0.3083
Epoch 6/6
Epoch 00005: early stopping
45s - loss: 1.7088 - acc: 0.4148 - val_loss: 2.0008 - val_acc: 0.3064
Epoch 1/5
68s - loss: 2.3071 - acc: 0.1814 - val_loss: 2.1719 - val_acc: 0.2237
Epoch 2/5
67s - loss: 2.0601 - acc: 0.2779 - val_loss: 2.0414 - val_acc: 0.2758
Epoch 3/5
67s - loss: 1.9256 - acc: 0.3275 - val_loss: 1.9928 - val_acc: 0.2947
Epoch 4/5
67s - loss: 1.8396 - acc: 0.3612 - val_loss: 1.9753 - val_acc: 0.3016
Epoch 5/5
67s - loss: 1.7676 - acc: 0.3895 - val_loss: 1.9737 - val_acc: 0.3065
Epoch 1/5
67s - loss: 2.3043 - acc: 0.1783 - val_loss: 2.1679 - val_acc: 0.2096
Epoch 2/5
67s - loss: 2.0578 - acc: 0.2733 - val_loss: 2.0405 - val_acc: 0.2726
Epoch 3/5
67s - loss: 1.9244 - acc: 0.3286 - val_loss: 1.9936 - val_acc: 0.2891
Epoch 4/5
66s - loss: 1.8386 - acc: 0.3595 - val_loss: 1.9768 - val_acc: 0.2988
Epoch 5/5
67s - loss: 1.7667 - acc: 0.3878 - val_loss: 1.9759 - val_acc: 0.3024
Epoch 1/5
68s - loss: 2.3078 - acc: 0.1852 - val_loss: 2.1736 - val_acc: 0.2211
Epoch 2/5
67s - loss: 2.0565 - acc: 0.2797 - val_loss: 2.0387 - val_acc: 0.2797
Epoch 3/5
67s - loss: 1.9181 - acc: 0.3307 - val_loss: 1.9904 - val_acc: 0.2962
Epoch 4/5
67s - loss: 1.8318 - acc: 0.3595 - val_loss: 1.9746 - val_acc: 0.3018
Epoch 5/5
67s - loss: 1.7601 - acc: 0.3892 - val_loss: 1.9744 - val_acc: 0.3048
Epoch 1/5
68s - loss: 2.3134 - acc: 0.1785 - val_loss: 2.1783 - val_acc: 0.2166
Epoch 2/5
67s - loss: 2.0659 - acc: 0.2725 - val_loss: 2.0434 - val_acc: 0.2780
Epoch 3/5
67s - loss: 1.9244 - acc: 0.3282 - val_loss: 1.9910 - val_acc: 0.2939
Epoch 4/5
67s - loss: 1.8349 - acc: 0.3578 - val_loss: 1.9740 - val_acc: 0.3012
Epoch 5/5
68s - loss: 1.7616 - acc: 0.3884 - val_loss: 1.9732 - val_acc: 0.3076

Now, we compute the predictions on the training set and the test set.


In [26]:
#############TRAINING PREDICTIONS##################

preds_train = []
train_columns = ['NN{0}'.format(col_idx) for col_idx in range(N_NNETS * NCLASSES)] 

for fold, (train_index, test_index) in enumerate(bags_split[0]):
    train_data = np.hstack([batch_predict(nnet, bags_split[1][test_index]) for nnet in nnets_train[fold]])
    #np.mean(np.array([batch_predict(nnet, bags_split[1][test_index]) for nnet in nnets_train[fold]]), axis = 0)    
    #np.hstack([batch_predict(net, bags_split[1][test_index]) for net in nnets_train[0]])
    ttrain_indices = pd.Series(ind_split[1][test_index], name = 'device_id')
    preds_train = preds_train + [pd.DataFrame(train_data, columns = train_columns, index = ttrain_indices)]
train_predictions = pd.concat(preds_train, axis = 0)

For the predictions on the holdout set, we can generate the maximum number of predictions in each fold.


In [33]:
#############TEST PREDICTIONS##################

ho_test_input = sparse.csr_matrix(sparse.vstack([bags_split[2], bags[labels.shape[0]:]]))
ho_test_data = np.hstack([batch_predict(net, ho_test_input) for net in nnets_test])
#np.mean(np.array([batch_predict(net, ho_test_input) for net in nnets_test]))
ho_test_index = pd.Series(np.hstack([ind_split[2], train_test['device_id'][labels.shape[0]:]]), name = 'device_id')
ho_test_predictions = pd.DataFrame(ho_test_data , columns = train_columns, index = ho_test_index)

It remains to merge the predictions.


In [34]:
nnet_features = train_test.merge(pd.concat([train_predictions, ho_test_predictions]),
                 'left', left_on = 'device_id', right_index = True).set_index('device_id')
nnet_names = train_columns

In [35]:
pickle.dump(nnet_features.values, open('{0}nnet_features2.p'.format(FEATURE_PATH), 'wb'))
pickle.dump(nnet_names, open('{0}nnet_names.p'.format(FEATURE_PATH), 'wb'))

Crosstab embedding of app and label features

Our goal is to embed the bags of apps and labels into a lower dimensional space. In addition to the neural network considered above, we use an ad hoc approach, where each app is encoded by its histogram. For large bags, this would lead to a large number of necessary encodings. Therefore, we use GMM to get the NCLUST most important components. As first step, we use the crosstab-encoder. In order to avoid distortion of validation scores, the crosstabs are only computed outside of the validation set.


In [84]:
class CrossTabEncoder(BaseEstimator, TransformerMixin):
    """CrossTabEncoder
    A CrossTabEncoder characterizes a feature by its crosstab dataframe.
    """
    def fit(self, data, ids_list):
        """For each class of the considered feature, the empirical histogram for the prediction classes is computed. 
        
        Parameters
        ----------
        data : feature column used for the histogram computation
        ids_list : list of ids used to split the training ids
        """        
        

        self.ids_pair = ids_list
        merged_data = [train_event[train_event['device_id'].isin(ids)].merge(data,
                                                'inner', 'device_id').drop('device_id', axis = 1) for ids in ids_list]
        data_total = pd.concat(merged_data, axis = 0)
        
        self.crosstabs = [pd.crosstab(mdata.iloc[:, 1], mdata.iloc[:, 0]).fillna(0).apply(compute_log_probs,axis=1)
                          for mdata in merged_data]
        self.crosstab_total = pd.crosstab(data_total.iloc[:, 1], data_total.iloc[:, 0]).fillna(0).apply(compute_log_probs, axis = 1) 
        
        return self

    def transform(self, data):
        """The precomputed histograms are joined as features to the given data set.
        
        Parameters
        ----------
        data : data that will be augmented by the crosstab feature
        
        Returns
        -------
        Transformed dataset.
        """
        feat_name = data.columns[1]
        
        #indices that are in neither of the trained stacking halves
        residual = pd.Index(train_test['device_id']).difference(ids_pair[0]).difference(ids_pair[1]).values
        
        #merging the crosstab features with the device data
        device_ct = pd.concat([merge_crosstab(data, feat_name, crosstab, ids) for crosstab, ids in 
                 [[self.crosstabs[1], self.ids_pair[0]],[self.crosstabs[0], self.ids_pair[1]]]], axis = 0)
        device_ct_total = data[data['device_id'].isin(residual)].merge(self.crosstab_total, 'left', 
                                     left_on = feat_name, right_index = True).fillna(0).drop(feat_name, axis = 1)  
        
        #combined_device = device_ct.combine_first(device_ct_total)
        combined_device = pd.concat([device_ct, device_ct_total], axis = 0)
        combined_device['device_id'] = combined_device['device_id'].astype(int)
        return  combined_device

def compute_log_probs(row):    
    """
    helper function for computing regularized log probabilities
    """
    row = row + np.ones(len(row))
    row_sum = row.sum()
    return (row/row_sum).apply(lambda y: math.log(y) - math.log(1.0/NCLASSES))

def merge_crosstab(data, feat_name, crosstab, ids):
    """
    helper function to join crosstab features with data
    """
    ct_data = data.merge(crosstab,  'left',   left_on = feat_name, right_index = True).drop(feat_name, axis = 1)
    ct_data = ct_data[ct_data['device_id'].isin(ids)]
    
    return ct_data

We define a function to compute the embeddings and join to the device data.


In [119]:
def embed(data,  feat_name):
    ct = CrossTabEncoder().fit(data, ind_split[:2])
    device_ct_combine = ct.transform(data).dropna()    
    group_sizes = device_ct_combine.iloc[:,0:2].groupby('device_id').transform(len)
    return [device_ct_combine.iloc[(group_sizes.values >= NCLUST).ravel(), :].set_index('device_id'),ct]

We merge the filtered set with the apps data and compute the crosstab features.


In [120]:
start = time.clock()
[device_apps_ct, ct_apps] = embed(device_apps_filtered,  'app_id')
print(time.clock()- start)
[device_labels_ct, ct_labels] = embed(device_label_small, 'label_id')
print(time.clock()- start)


119.31499900000006
226.03547300000002

GMM for embedding (124 Features)

As features, we consider two geometric manipulations of the bag of embedded vectors. First, the mean of the embedded vectors and second the GMM. To achieve this, we define a function to compute the means and GMMs of the app and label embeddings for each device-id.


In [104]:
def fit_gmm(data):
    #fitting of gmm
    gmm = GMM(N_CLUST, random_state = SEED)
    gmm.fit(data)
    
    #column names
    mean_cols = ['Mean{0}'.format(i) for i in data.columns]
    gmm_mean_cols = ['GMM-Mean{0}'.format(i) for i in range(len(gmm.means_.ravel()))]
    gmm_weight_cols = ['GMM-Weight{0}'.format(i) for i in range(len(gmm.weights_))]
    gmm_covar_cols = ['GMM-Covars{0}'.format(i) for i in range(len(gmm.covars_.ravel()))]
    
    gmm_features = np.hstack([data.mean(), gmm.means_.ravel(), gmm.weights_, gmm.covars_.ravel()])
    gmm_index = np.hstack([mean_cols, gmm_mean_cols, gmm_weight_cols, gmm_covar_cols])
    
    return pd.Series( gmm_features, index = index_km)

Now, we compute the function on the data and join back to the original set of train and test ids.


In [133]:
start = time.clock()
gmm_apps = device_apps_ct.groupby(level = 0).apply(fit_gmm)
print(time.clock()- start)
gmm_labels = device_labels_ct.groupby(level = 0).apply(fit_gmm)
print(time.clock()- start)

gmm_apps_feature = train_test.merge(gmm_apps, 'left', left_on = 'device_id', right_index = True).set_index('device_id')
gmm_labels_feature = train_test.merge(gmm_labels, 'left', left_on = 'device_id', right_index = True).set_index('device_id')


1839.775029
3787.1200769999996

Finally, we stack the GMM-features together.


In [134]:
gmm = np.hstack([gmm_apps_feature, gmm_labels_feature])
gmm_names = np.hstack([['App-{0}'.format(col) for col in gmm_apps_feature], ['Label-{0}'.format(col) for col in gmm_apps_feature]])

Concluding feature union (4620 features)

Finally, everything is pickled.


In [138]:
app_features = sparse.csr_matrix(sparse.hstack([app_cnt, gmm, bags]))
app_features_names = np.hstack([app_cnt_names, gmm_names, bags_names])

pickle.dump(app_features, open('{0}app_features.p'.format(FEATURE_PATH), 'wb'))
pickle.dump(app_features_names, open('{0}app_features_names.p'.format(FEATURE_PATH), 'wb'))