In this notebook, we explore the predictive power of the features via xgb. We restrict to devices for which no event information is available.
In [1]:
%matplotlib inline
import matplotlib.patches as mpatches
import matplotlib.pylab as plt
import numpy as np
import operator
import pandas as pd
import pickle
import xgboost as xgb
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import LabelEncoder
from xgboost.sklearn import XGBClassifier
#path to data, features and models
DATA_PATH = "../../../input/"
FEATURE_PATH = "../../../features/"
#seed for randomness
SEED = 1747
############################################
###XGB PARAMETERS
############################################
#parameters for xgb fitting
FIT_PARAMS = {
'verbose_eval': 100,
'early_stopping_rounds': 10,
'num_boost_round': 700,
}
#params for xgb
HYPER_PARAMS = {
"objective": "multi:softprob",
"num_class": 12,
'eval_metric': 'mlogloss',
}
HYPER_PARAMS = dict(HYPER_PARAMS , ** {'gamma': 5.0751583955640074e-08, 'nthread': 2, 'subsample': 0.35000000000000003,
'colsample_bytree': 0.6000000000000001,
'max_depth': 6.0, 'seed': 1747, 'max_delta_step': 4.5, 'learning_rate': 0.032, 'reg_lambda': 2.9837914959522025e-10,
'objective': 'multi:softmax', 'min_child_weight': 2.144646982692753e-06, "n_estimators": 280,
'reg_alpha': 1.223491951048497e-10})
We load the group labels and the features generated in previous notebooks.
In [10]:
train = pd.read_csv('{0}gender_age_train.csv'.format(DATA_PATH))
features = pickle.load(open('{0}phone_model_features.p'.format(FEATURE_PATH), 'rb'))
names = pickle.load(open('{0}phone_model_features_names.p'.format(FEATURE_PATH), 'rb'))
The features extracted in the previous steps are stacked together and a validation set is created.
In [11]:
labels = LabelEncoder().fit_transform(train['group'])
X_train, X_val, y_train, y_val = train_test_split(features[:train.shape[0], :], labels,
stratify = labels, train_size = 0.8, random_state = SEED)
X_train.shape
Out[11]:
This training data is used as input for the xgb classifier.
In [12]:
dtrain = xgb.DMatrix(X_train, y_train)
dvalid = xgb.DMatrix(X_val, y_val)
gbm = xgb.train(HYPER_PARAMS, dtrain, evals = [(dtrain, 'train'), (dvalid, 'eval')], **FIT_PARAMS )
Now, we use importance scores to identify powerful predictors. We see that the device model has a substantially higher predictive power than the other features.
In [9]:
df = pd.Series(gbm.get_fscore(), name = 'fscore').sort_values(ascending = False)
#add true column names
feat_num = [int(col_idx.replace('f','')) for col_idx in df.index]
df.index = names[feat_num]
#set color and draw chart
cols = np.where(['device_model' in word for word in df.index], 'r', 'w')
ax = df[::-1].plot(kind = 'barh', figsize = (20, 15), color = cols[::-1])
ax.set_title('Importance Scores')
red_patch = mpatches.Patch(color='r', label='device-model')
blue_patch = mpatches.Patch(color='w', label='phone-brand/device-pref')
ax.legend(handles=[red_patch, blue_patch], bbox_to_anchor=(1, 0.9))
Out[9]:
In [ ]: