TalkingData -- Feature Importance subclass of devices with events

In this notebook, we explore the predictive power of the features via xgb. We restrict to devices for which event information is available. At the end, we use feature importances to keep only features of high predictive power.



In [14]:

    
%matplotlib inline
import matplotlib.patches as mpatches
import matplotlib.pylab as plt
import numpy as np
import operator
import pandas as pd
import pickle
import xgboost as xgb

from scipy import sparse
from sklearn.preprocessing import LabelEncoder
from sklearn.cross_validation import train_test_split
from xgboost.sklearn import XGBClassifier

#path to data, features and models
DATA_PATH = "../../../input/"
FEATURE_PATH = "../../../features/"
CLF_PATH = "../../../models/"

#only features whose scores is larger than IMPORTANCE_THRESH ARE SELECTED
IMPORTANCE_THRESH = 1

#number of features shown in the importance plot
FEAT_NUM_PLOT = 10

#seed for randomness
SEED = 1747
np.random.seed(SEED)

#number of bars to show in importance plot
NBAR = 30


############################################
###XGB PARAMETERS
############################################

#parameters for xgb fitting
FIT_PARAMS = {
'verbose_eval': 50, 
'early_stopping_rounds': 10,
'num_boost_round': 2000,
}

#params for xgb
HYPER_PARAMS = {
"objective": "multi:softprob",
"num_class": 12,
'eval_metric': 'mlogloss', 
}

#params for xgb
HYPER_PARAMS = dict(HYPER_PARAMS , **{ 'max_depth': 4.0,'colsample_bytree': 0.35,  'subsample': 0.7,  'reg_alpha': 0.34, 
                                      'seed': 1747, 'gamma': 2.07e-07,  'n_estimators': 300,'nthread': 2,  'reg_lambda': 0, 
                                      'learning_rate': 0.018, 'min_child_weight': 0.044, 'max_delta_step': 3.0})

We load the group labels and the features generated in previous notebooks.



In [11]:

    
train = pd.read_csv('{0}gender_age_train.csv'.format(DATA_PATH)).loc[:, ['device_id', 'group']]
test = pd.read_csv('{0}gender_age_test.csv'.format(DATA_PATH)).loc[:, ['device_id']]
train_test = pd.DataFrame(pd.concat([train, test], axis = 0)['device_id'])

pd_features = pickle.load(open('{0}phone_model_features_event.p'.format(FEATURE_PATH), 'rb'))
pd_features_names = pickle.load(open('{0}phone_model_features_names_event.p'.format(FEATURE_PATH), 'rb'))

event_features = pickle.load(open('{0}event_features_noloc.p'.format(FEATURE_PATH), 'rb'))
event_features_names = pickle.load(open('{0}event_features_noloc_names.p'.format(FEATURE_PATH), 'rb'))

bag_of_apps = pickle.load(open('{0}app_features.p'.format(FEATURE_PATH), 'rb'))
bag_of_apps_names = pickle.load(open('{0}app_features_names.p'.format(FEATURE_PATH), 'rb'))

nnet_features = pickle.load(open('{0}nnet_features2.p'.format(FEATURE_PATH), 'rb'))
nnet_features_names = pickle.load(open('{0}nnet_names.p'.format(FEATURE_PATH), 'rb'))

From the training data we select only those indices that have at least some events associated with them.



In [3]:

    
train_ids = pickle.load(open('{0}train_event_ids'.format(DATA_PATH), 'rb'))
test_ids  = pickle.load(open('{0}test_event_ids'.format(DATA_PATH), 'rb'))
train_test_ids = list(train_ids) + list(test_ids)
train_mask = train_test['device_id'].isin(train_test_ids)
train_test_idxs = np.where(train_mask)[0]

train_test_event = train_test.iloc[train_test_idxs, :]
labels = LabelEncoder().fit_transform(train[train_mask.values[:train.shape[0]]]['group'])

The features extracted in the previous steps are stacked together and a validation set is created.



In [12]:

    
features = sparse.hstack([pd_features, event_features,  nnet_features,  bag_of_apps])
###LOAD FEATURES FROM PREVIOUS ITERATIONS
#features = pickle.load(open("{0}feature_sel_lvl0_events.p".format(FEATURE_PATH), 'rb'))

names = np.hstack([pd_features_names, event_features_names, nnet_features_names, bag_of_apps_names])

X_train, X_val, y_train, y_val = train_test_split(sparse.csr_matrix(features)[:len(train_ids), :], 
                                                  labels, stratify = labels, train_size = 0.8, random_state = SEED)
print(features.shape)
print(names.shape)









    



(58503, 4620)
(4620,)

This training data is used as input for the xgb classifier.



In [15]:

    
dtrain = xgb.DMatrix(X_train, y_train)
dvalid = xgb.DMatrix(X_val, y_val)

gbm = xgb.train(HYPER_PARAMS, dtrain, evals = [(dtrain, 'train'), (dvalid, 'eval')], **FIT_PARAMS )









    



Will train until eval error hasn't decreased in 10 rounds.
[0]	train-mlogloss:2.472120	eval-mlogloss:2.473637
[50]	train-mlogloss:2.097248	eval-mlogloss:2.160337
[100]	train-mlogloss:1.932289	eval-mlogloss:2.047235
[150]	train-mlogloss:1.834830	eval-mlogloss:1.996931
[200]	train-mlogloss:1.766187	eval-mlogloss:1.971833
[250]	train-mlogloss:1.711764	eval-mlogloss:1.958266
[300]	train-mlogloss:1.665693	eval-mlogloss:1.950796
[350]	train-mlogloss:1.625113	eval-mlogloss:1.946868
[400]	train-mlogloss:1.588336	eval-mlogloss:1.944368
[450]	train-mlogloss:1.553331	eval-mlogloss:1.942596
[500]	train-mlogloss:1.520472	eval-mlogloss:1.941656
[550]	train-mlogloss:1.488630	eval-mlogloss:1.940379
Stopping. Best iteration:
[562]	train-mlogloss:1.481375	eval-mlogloss:1.940198

Now, we use importance scores to identify powerful predictors. The phone brand and device model are of crucial importance. Apart from that neural network dominate.



In [17]:

    
df = pd.Series(gbm.get_fscore(), name = 'fscore').sort_values(ascending = False)

#add true column names
feat_num = [int(col_idx.replace('f','')) for col_idx in df.index]
df.index = names[feat_num]

#set color and draw chart
condlist = [['NN' in str(word) for word in df.index]]
choicelist = [['r'] * len(df.index)]
cols = np.select(condlist, choicelist, default = 'w')

ax = df.iloc[NBAR::-1].plot(kind = 'barh', figsize = (20, 15), color = cols[NBAR::-1])


ax.set_title('Importance Scores')
red_patch = mpatches.Patch(color='r', label = 'Neural network features')
ax.legend(handles=[red_patch], bbox_to_anchor=(1, 0.9))









    Out[17]:





<matplotlib.legend.Legend at 0x7f71a22b3208>

The most important features as well as their indices are persisted to disk, so that they can be used as input for further classifiers.



In [19]:

    
relevant_features = sparse.csr_matrix(features)[:, feat_num[: np.where(df == 1)[0][0]]]
pickle.dump(relevant_features, open("{0}feature_sel_lvl0_events.p".format(FEATURE_PATH), 'wb'))
relevant_features.shape









    Out[19]:





(58503, 2782)



In [ ]: