In this notebook, we explore the predictive power of the features via xgb. We restrict to devices for which event information is available. At the end, we use feature importances to keep only features of high predictive power.
In [14]:
%matplotlib inline
import matplotlib.patches as mpatches
import matplotlib.pylab as plt
import numpy as np
import operator
import pandas as pd
import pickle
import xgboost as xgb
from scipy import sparse
from sklearn.preprocessing import LabelEncoder
from sklearn.cross_validation import train_test_split
from xgboost.sklearn import XGBClassifier
#path to data, features and models
DATA_PATH = "../../../input/"
FEATURE_PATH = "../../../features/"
CLF_PATH = "../../../models/"
#only features whose scores is larger than IMPORTANCE_THRESH ARE SELECTED
IMPORTANCE_THRESH = 1
#number of features shown in the importance plot
FEAT_NUM_PLOT = 10
#seed for randomness
SEED = 1747
np.random.seed(SEED)
#number of bars to show in importance plot
NBAR = 30
############################################
###XGB PARAMETERS
############################################
#parameters for xgb fitting
FIT_PARAMS = {
'verbose_eval': 50,
'early_stopping_rounds': 10,
'num_boost_round': 2000,
}
#params for xgb
HYPER_PARAMS = {
"objective": "multi:softprob",
"num_class": 12,
'eval_metric': 'mlogloss',
}
#params for xgb
HYPER_PARAMS = dict(HYPER_PARAMS , **{ 'max_depth': 4.0,'colsample_bytree': 0.35, 'subsample': 0.7, 'reg_alpha': 0.34,
'seed': 1747, 'gamma': 2.07e-07, 'n_estimators': 300,'nthread': 2, 'reg_lambda': 0,
'learning_rate': 0.018, 'min_child_weight': 0.044, 'max_delta_step': 3.0})
We load the group labels and the features generated in previous notebooks.
In [11]:
train = pd.read_csv('{0}gender_age_train.csv'.format(DATA_PATH)).loc[:, ['device_id', 'group']]
test = pd.read_csv('{0}gender_age_test.csv'.format(DATA_PATH)).loc[:, ['device_id']]
train_test = pd.DataFrame(pd.concat([train, test], axis = 0)['device_id'])
pd_features = pickle.load(open('{0}phone_model_features_event.p'.format(FEATURE_PATH), 'rb'))
pd_features_names = pickle.load(open('{0}phone_model_features_names_event.p'.format(FEATURE_PATH), 'rb'))
event_features = pickle.load(open('{0}event_features_noloc.p'.format(FEATURE_PATH), 'rb'))
event_features_names = pickle.load(open('{0}event_features_noloc_names.p'.format(FEATURE_PATH), 'rb'))
bag_of_apps = pickle.load(open('{0}app_features.p'.format(FEATURE_PATH), 'rb'))
bag_of_apps_names = pickle.load(open('{0}app_features_names.p'.format(FEATURE_PATH), 'rb'))
nnet_features = pickle.load(open('{0}nnet_features2.p'.format(FEATURE_PATH), 'rb'))
nnet_features_names = pickle.load(open('{0}nnet_names.p'.format(FEATURE_PATH), 'rb'))
From the training data we select only those indices that have at least some events associated with them.
In [3]:
train_ids = pickle.load(open('{0}train_event_ids'.format(DATA_PATH), 'rb'))
test_ids = pickle.load(open('{0}test_event_ids'.format(DATA_PATH), 'rb'))
train_test_ids = list(train_ids) + list(test_ids)
train_mask = train_test['device_id'].isin(train_test_ids)
train_test_idxs = np.where(train_mask)[0]
train_test_event = train_test.iloc[train_test_idxs, :]
labels = LabelEncoder().fit_transform(train[train_mask.values[:train.shape[0]]]['group'])
The features extracted in the previous steps are stacked together and a validation set is created.
In [12]:
features = sparse.hstack([pd_features, event_features, nnet_features, bag_of_apps])
###LOAD FEATURES FROM PREVIOUS ITERATIONS
#features = pickle.load(open("{0}feature_sel_lvl0_events.p".format(FEATURE_PATH), 'rb'))
names = np.hstack([pd_features_names, event_features_names, nnet_features_names, bag_of_apps_names])
X_train, X_val, y_train, y_val = train_test_split(sparse.csr_matrix(features)[:len(train_ids), :],
labels, stratify = labels, train_size = 0.8, random_state = SEED)
print(features.shape)
print(names.shape)
This training data is used as input for the xgb classifier.
In [15]:
dtrain = xgb.DMatrix(X_train, y_train)
dvalid = xgb.DMatrix(X_val, y_val)
gbm = xgb.train(HYPER_PARAMS, dtrain, evals = [(dtrain, 'train'), (dvalid, 'eval')], **FIT_PARAMS )
Now, we use importance scores to identify powerful predictors. The phone brand and device model are of crucial importance. Apart from that neural network dominate.
In [17]:
df = pd.Series(gbm.get_fscore(), name = 'fscore').sort_values(ascending = False)
#add true column names
feat_num = [int(col_idx.replace('f','')) for col_idx in df.index]
df.index = names[feat_num]
#set color and draw chart
condlist = [['NN' in str(word) for word in df.index]]
choicelist = [['r'] * len(df.index)]
cols = np.select(condlist, choicelist, default = 'w')
ax = df.iloc[NBAR::-1].plot(kind = 'barh', figsize = (20, 15), color = cols[NBAR::-1])
ax.set_title('Importance Scores')
red_patch = mpatches.Patch(color='r', label = 'Neural network features')
ax.legend(handles=[red_patch], bbox_to_anchor=(1, 0.9))
Out[17]:
The most important features as well as their indices are persisted to disk, so that they can be used as input for further classifiers.
In [19]:
relevant_features = sparse.csr_matrix(features)[:, feat_num[: np.where(df == 1)[0][0]]]
pickle.dump(relevant_features, open("{0}feature_sel_lvl0_events.p".format(FEATURE_PATH), 'wb'))
relevant_features.shape
Out[19]:
In [ ]: