Artem Zaika: KDD Cup 2009 - Customer relationship prediction
For the study data were taken from the competition KDD 2009 on Customer Relationship prediction
The task is to estimate the churn of customers, hence there are two target values to be predicted. The challenge is staged in phases to test the rapidity with which each team is able to produce results. A large number of variables (15,000) is made available for prediction. However, to engage participants having access to less computing power, a smaller version of the dataset with only 230 variables will be made available in the second part of the challenge.
Churn (wikipedia definition): Churn rate is also sometimes called attrition rate. It is one of two primary factors that determine the steady-state level of customers a business will support. In its broadest sense, churn rate is a measure of the number of individuals or items moving into or out of a collection over a specific period of time. The term is used in many contexts, but is most widely applied in business with respect to a contractual customer base. For instance, it is an important factor for any business with a subscriber-based service model, including mobile telephone networks and pay TV operators. The term is also used to refer to participant turnover in peer-to-peer networks.
The small dataset will be made available at the end of the fast challenge. Both training and test sets contain 50,000 examples. The data are split similarly for the small and large versions, but the samples are ordered differently within the training and within the test sets. Both small and large datasets have numerical and categorical variables. For the large dataset, the first 14,740 variables are numerical and the last 260 are categorical. For the small dataset, the first 190 variables are numerical and the last 40 are categorical. Toy target values are available only for practice purpose. The prediction of the toy target values will not be part of the final evaluation.
The performances are evaluated according to the arithmetic mean of the AUC for the hurn. This is what we call "Score".
My work will be divided into several parts:
In [1]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import xgboost as xgb
from statsmodels.stats.weightstats import *
from statsmodels.stats.proportion import proportion_confint
from scipy import stats
from collections import defaultdict, Counter
import itertools
from sklearn.preprocessing import StandardScaler, RobustScaler, LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score
from sklearn.feature_selection import SelectKBest,RFECV, SelectFromModel
from sklearn.feature_selection import mutual_info_classif, f_classif
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.model_selection import validation_curve, learning_curve
from sklearn.model_selection import ShuffleSplit, StratifiedShuffleSplit, StratifiedKFold
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline
seed = 123
np.random.seed(seed)
In [2]:
xgb.__version__
Out[2]:
Сonduct primary data analysis. Look at the data validity and omissions
In [3]:
df = pd.read_csv('orange_small_churn_train_data.csv', index_col='ID')
df.head()
Out[3]:
For convenience, replace the target variable with the label -1 (churn) on 0
In [4]:
df.loc[df['labels'] == -1, 'labels'] = 0
As we see in the data a lot of missing values. There are also features, which have missed all the values. When recovery permits, it is important to have a reliable allocation, but if the blanks in features of more than 60%, then I think such a distribution can not be trusted and means to restore such signs are useless, because they introduce only noise in the model
Remove features of having gaps in more than 65% of cases
In [5]:
def drop_empty(df):
has_nan = df.isnull().sum()
empty_cols = has_nan[has_nan >= len(df) * 0.65]
to_drop = empty_cols.index.values.tolist()
df.drop(to_drop, axis=1, inplace = True)
return df
df = drop_empty(df)
df.head()
Out[5]:
In [6]:
data, labels = df.iloc[:, :-1], df.iloc[:,-1]
Calculate proporions of classes “churn” and “not churn”.
In [29]:
length = data.shape[0]
label_counts = labels.value_counts()
print 'Proporion of class “not churn”:', float(label_counts[0]) / length
print 'Proporion of class “churn”:', float(label_counts[1]) / length
As we see, proportion of classes 1 is more than proportion of class 0, in almost 130 times. It's not good as for classification it is important that classes in the sample were balanced. Ie in our case the number of labels class 1 - 'not churn', the proportion must be equal to the number of tags of class 0 - 'churn'. If the balance is skewed toward larger class, models, especially linear, can be too retrained to detect class with higher proportion and ignore the small one.
Later on we will continue to conduct experiments to increase number of objects of smaller class and see how it will affect the model evaluation.
There are two types of features in our sample- real and categorical. Separate them and, by default, fill in the blanks in the numerical features with zeros and categorical features with string - 'NaN'
In [8]:
num_features = data.select_dtypes(include=['float64', 'int64']).columns
cat_features = data.select_dtypes(include=['object']).columns
data[num_features] = data[num_features].fillna(0)
data[cat_features] = data[cat_features].fillna('NaN')
Evaluate the density distribution of the classes. This will allow first, to assess the type of distribution(this can be useful when restoring blanks), and the ratio of distributions of features for different classes.
In [9]:
fig, axes = plt.subplots(11, 4, figsize=(20, 30))
for i in xrange(len(num_features)):
sns.kdeplot(data.loc[labels == 0, num_features[i]], color = 'r', ax=axes[i / 4, i % 4], label=0)
sns.kdeplot(data.loc[labels == 1, num_features[i]], color = 'b', ax=axes[i / 4, i % 4], label=1)
axes[i / 4, i % 4].set(xlabel=num_features[i])
fig.tight_layout()
So, what can we say about the densities of the distributions? They do not belong to a normal distribution and some distribution, for example Var189, Var73, Var126 clearly suggests that the distribution of classes 0 and 1 for these features are different, so could correlate differently with the target label
Why I started talking about the correlation between the features and target label. The thing is that in the calculation of dependencies of target labels from features, it is highly desirable to do so, so the features are poorly correlated among themselves, but were highly correlated with the target label. If features are not highly correlated with the target label, it is indicative of their usefulness in prediction. If the features correlate strongly with each other, it suggests redundancy and can negative affect the quality of prediction. Thus, highly correlated features can be simplified or somehow express each other and it can have a positive effect in forecasting.
Because the target variable is binary (classes do not churn/churn), will not fit the calculation of the Pearson correlation (or any other), so calculate the difference of mathematical expectation of feature for both classes of the target variable. Distribution of numerical features can be very different (see below), so as a measure of relatedness between the target variable and the feature will be used the significance level of the hypothesis that the characteristic distribution for both classes (the target variable) are the same. Using the Mann-Whitney test get that the mathematical expectation of 24 features for both classes are different at significance level 0.01:
In [10]:
# check whether average is significantly different for both classes in each feature
pvals = []
for i, col_name in enumerate(num_features):
class0 = data.loc[labels == 0, col_name]
class1 = data.loc[labels == 1, col_name]
pvals.append(stats.mannwhitneyu(class0, class1)[1])
pvals_df = pd.DataFrame({'p_value': pvals}, index=num_features).sort_values('p_value', ascending=True)
# the number of variables that have statistically significantly different average
sum(pvals_df['p_value'] < 0.01)
Out[10]:
In [91]:
top20 = pvals_df[:20].index.tolist()
pvals_df.head()
Out[91]:
Graphs of the density distributions of the first 8 most different features(the weakest correlated) depending on the target variable:
In [92]:
fig, axes = plt.subplots(2, 4, figsize=(16, 6))
for i in range(8):
sns.kdeplot(data.loc[labels == 0, pvals_df.index[i]], color = 'r', ax=axes[i / 4, i % 4], label=0)
sns.kdeplot(data.loc[labels == 1, pvals_df.index[i]], color = 'b', ax=axes[i / 4, i % 4], label=1)
axes[i / 4, i % 4].set(xlabel=pvals_df.index[i])
fig.tight_layout()
Numeric features have a lot of blanks, so the lack of values in a numeric variable can also be a good marker-feature for the model. For example, one of the previous feature, Var126, we compare the proportion of positive class for the whole sample and among the missing values of the feature Var126:
In [27]:
print "Proportion of class churn", float(label_counts[1]) / length
print "Proportion of class churn for missed values of feature Var126:", np.round(labels[df.Var126.isnull()].mean(), 3)
To be sure that the proportions differ significantly, considering a confidence interval of Wilson for the fraction of missing values of the characteristic Var126:
In [28]:
proportion = labels[df.Var126.isnull()].mean()
n = sum(df.Var126.isnull())
print "Confidence interval:", np.round(proportion_confint(proportion*n, n, alpha = 0.05, method="wilson"), 4)
Categorical variables are very different, some of them has 10 to >13 thousand unique values. To check which of the variables can be useful for us let's calculate for each unique value of a categorical feature the proportion of positive class, and construct a confidence interval for this proportion. If the proportion of the original dataset is missing in the interval, then this feature (with these unique values) may be useful for the model.
In [95]:
# the Cramer's correlation coefficient for categorical variables
# http://stackoverflow.com/questions/20892799/using-pandas-calculate-cramérs-coefficient-matrix
def cramers_corrected_stat(confusion_matrix):
""" calculate Cramers V statistic for categorial-categorial association.
uses correction from Bergsma and Wicher,
Journal of the Korean Statistical Society 42 (2013): 323-328
"""
chi2 = stats.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum()
phi2 = chi2/n
r,k = confusion_matrix.shape
phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
rcorr = r - ((r-1)**2)/(n-1)
kcorr = k - ((k-1)**2)/(n-1)
return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))
# wrapper for the function above
def get_cramers_cor(var1, var2):
if var1.equals(var2):
return 1.0
confusion_matrix = pd.crosstab(var1, var2)
return cramers_corrected_stat(np.array(confusion_matrix))
# build Wilson's confidence interval and check whether basline belongs to it
def in_conf_int(baseline, proportion, n, alpha = 0.05):
conf_int = proportion_confint(proportion*n, n, alpha = alpha, method="wilson")
if baseline > conf_int[0] and baseline < conf_int[1]:
return True
return False
In [96]:
# calculate correlations
conf_matrix_cor = pd.DataFrame(1.0, index=cat_features, columns=cat_features)
for feat in itertools.combinations(cat_features, 2):
conf_matrix_cor.loc[feat[0], feat[1]] = get_cramers_cor(data[feat[0]], data[feat[1]])
conf_matrix_cor.loc[feat[1], feat[0]] = conf_matrix_cor.loc[feat[0], feat[1]]
# triangle mask
mask = np.zeros_like(conf_matrix_cor, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# plot params
f, ax = plt.subplots(figsize=(12, 12))
# palette
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# heat-map drawing
sns.heatmap(conf_matrix_cor, mask=mask, cmap=cmap, vmax=1.0, vmin=0.0,
square=True, xticklabels=True, yticklabels=True,
linewidths=.1, cbar_kws={"shrink": .5}, ax=ax)
plt.show()
Conclusions: So, we conducted primary data analysis and identified features and categorially features that can have the greatest impact on prognosis. Such analysis can be very useful when working with linear models, but when working with Trees, RandomForest and Gradient Boosting models, it may not be very useful, because these models in the preview are not sensitive to distortions in classes, and secondly use a different approach in assessing the importance of features(the so-called 'information gain').
Hereafter, I will use the model gradient boosting - XGBoost. It will have some influence on the data, such as:
Make some utility functions for visualization of cross-validation results of XGBoost.
In [32]:
# Graph results sklearn.model_selection.learning_curve
def plot_curves(train_scores, test_scores, train_sizes, x_text="num of samples"):
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
fig = plt.figure(figsize=(10, 6), dpi=100)
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.xlabel(x_text)
plt.ylabel("score")
plt.legend(loc="best")
plt.show()
i = np.argmax(test_scores_mean)
print("Best cross-validation result ({0:.2f}) obtained for {1} ".format(test_scores_mean[i], train_sizes[i]))
# visualize score depending on number of objects
def train_learning_curve(X_train, y_train, default_params=None):
scale_pos_weight = float(sum(y_train == 0))/sum(y_train == 1)
if default_params is None:
default_params = {
'objective': 'binary:logistic',
'max_depth': 2,
'learning_rate': 0.3,
'silent': 1.0,
'scale_pos_weight':scale_pos_weight,
'n_estimators':23,
'seed': seed
}
default_params['scale_pos_weight'] = scale_pos_weight
train_sizes, train_scores, valid_scores = learning_curve(
xgb.XGBClassifier(**default_params),
X_train, y_train,
scoring='roc_auc',
cv=StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=seed), n_jobs=-1)
plot_curves(train_scores, valid_scores, train_sizes)
# Score - number of algorithms dependency visualizaion
def train_validation_curve(X_train, y_train, est_end=200, default_params=None):
scale_pos_weight = float(sum(y_train == 0))/sum(y_train == 1)
if default_params is None:
default_params = {
'objective': 'binary:logistic',
'max_depth': 2,
'learning_rate': 0.3,
'silent': 1.0,
'scale_pos_weight':scale_pos_weight,
'seed': seed
}
default_params['scale_pos_weight'] = scale_pos_weight
n_estimators_range = np.linspace(1, est_end, 10).astype('int')
train_scores, test_scores = validation_curve(
xgb.XGBClassifier(**default_params),
X_train, y_train,
param_name = 'n_estimators',
param_range = n_estimators_range,
cv=5,
scoring='roc_auc', n_jobs=-1)
plot_curves(train_scores, test_scores, n_estimators_range, x_text="num of trees")
# Cross-validation
def cross_validation(train, labels, default_params=None, num_rounds=23): # how many estimators
scale_pos_weight = float(sum(labels == 0))/sum(labels == 1)
if default_params is None:
default_params = {'objective': 'binary:logistic',
'max_depth': 2,
'learning_rate': 0.3,
'seed': seed}
default_params['scale_pos_weight'] = scale_pos_weight
dtrain = xgb.DMatrix(data=train, label=labels, silent=False, feature_names=train.columns)
cv_results = xgb.cv(default_params,
dtrain,
num_boost_round=num_rounds,
nfold=5,
metrics=['auc'],
# shuffle=True,
# stratified=True,
early_stopping_rounds=5,
seed=seed)
return cv_results
It's time to create a delayed sample. It will be a wet towel on your face, when success hit the head due to the fact that you(or rather I) will obtain estimates for roc_auc=1.0.
There is one more thing that we need to consider. Models in Machine Learning can only work with numeric values, but not with text data. So we need to convert categorical features to numerical. The basic approach - we will encode all the blanks in the physical signs to 0, in the categorical - 'NaN' and range them using LabelEncoder().
In [30]:
data[cat_features] = data[cat_features].fillna('NaN').apply(lambda x: LabelEncoder().fit_transform(x))
X_train, X_hold_out, y_train, y_hold_out = train_test_split(data, labels, test_size=0.25,
random_state=seed, stratify=labels)
print 'Any missing values? ', X_hold_out.isnull().values.any()
In [31]:
X_train.head()
Out[31]:
The data is normal view now, let's first find the desired number of algorithms in xgboost. This is necessary in order not to waste time building unnecessary algorithms and thus to speed up the model training. Put the maximum range is 100. Leave other parameters by default.
In [123]:
default_params = {
'objective': 'binary:logistic',
'max_depth': 2,
'learning_rate': 0.3,
'silent': 1.0,
'seed': seed
}
train_validation_curve(X_train, y_train, est_end=100, default_params=default_params)
As we can see, the highest rating obtained in the number of trees 23. The following tests will use the rounded number of algorithms to 24
In [37]:
default_params = {'objective': 'binary:logistic',
'max_depth': 2,
'learning_rate': 0.3,
'silent': 1.0,
'n_estimators': 24,
'seed': seed }
How to interpret the behavior of the curves in the graphs? Comes to the rescue of the information from the sklearn package: http://scikit-learn.org/stable/modules/learning_curve.html#learning-curve
Description of validation-curve
If the training score and the validation score are both low, the estimator will be underfitting. If the training score is high and the validation score is low, the estimator is overfitting and otherwise it is working very well. A low training score and a high validation score is usually not possible
Description of training-curve
A learning curve shows the validation and training score of an estimator for varying numbers of training samples. It is a tool to find out how much we benefit from adding more training data and whether the estimator suffers more from a variance error or a bias error. If both the validation score and the training score converge to a value that is too low with increasing size of the training set, we will not benefit much from more training data. In the following plot you can see an example: naive Bayes roughly converges to a low score.
We will probably have to use an estimator or a parametrization of the current estimator that can learn more complex concepts (i.e. has a lower bias). If the training score is much greater than the validation score for the maximum number of training samples, adding more training samples will most likely increase generalization.
To sum up, my goal of visual analysis is to make sure that the curves were on the same level, but did not go far up, because it is a signal about retraining
Let's start simple. Let's estimate how many objects do we need to build quality models. Training is available for a fairly large sample and it may be that at some point the growth of the size of the training sample ceases to affect the quality of the model. Construct learning curves, training the model on samples of different sizes starting with a small number of objects in the training set and gradually increasing its size incrementally.
In [127]:
_X_train, _, _y_train, __ = train_test_split(X_train, y_train, stratify=y_train, train_size=0.3, random_state=seed)
train_learning_curve(_X_train, _y_train, default_params=default_params)
Compare with results of cross-validation. The bottom line is different, but the results are quite acceptable
In [128]:
cv_results = cross_validation(_X_train, _y_train, num_rounds=23)
cv_results.tail(5)
Out[128]:
In [129]:
_X_train, _, _y_train, __ = train_test_split(X_train, y_train, stratify=y_train, train_size=0.5, random_state=seed)
train_learning_curve(_X_train, _y_train, default_params=default_params)
Compare with results of cross-validation.
In [131]:
cv_results = cross_validation(_X_train, _y_train, num_rounds=23)
cv_results.tail(5)
Out[131]:
In [132]:
_X_train, _, _y_train, __ = train_test_split(X_train, y_train, stratify=y_train, train_size=0.8, random_state=seed)
train_learning_curve(_X_train, _y_train, default_params=default_params)
Compare with results of cross-validation.
In [133]:
cv_results = cross_validation(_X_train, _y_train, num_rounds=23)
cv_results.tail(5)
Out[133]:
In [134]:
train_learning_curve(X_train, y_train, default_params=default_params)
print y_train.value_counts()
Compare with results of cross-validation.
In [135]:
cv_results = cross_validation(X_train, y_train, num_rounds=23)
cv_results.tail(5)
Out[135]:
So, if we increase the sample assessment roc_auc is growing. Arrange this assessment the test of cruel reality - check it on hold-out sample
In [36]:
params = {'objective': 'binary:logistic',
'max_depth': 2,
'learning_rate': 0.3,
'silent': 1,
'n_estimators': 23,
'scale_pos_weight':float(sum(y_train == 0))/sum(y_train == 1),
'seed': seed }
gtree = xgb.XGBClassifier(**params)
gtree.fit(X_train, y_train, eval_metric='auc')
print 'Hold-out set, roc-auc: ', roc_auc_score(y_hold_out, gtree.predict_proba(X_hold_out)[:, 1])
Often unbalanced classes samples lead to some problems while training the models. For instance model can fit perfectly the samples of biggest class and show good accuracy score, but recall score will be very poor.
Let's try to differentiate the sample, play with the distribution of objects into classes and conclude about how the class ratio affects the quality of the model.
In [34]:
def oversample_x(data, labels, labels_x, amount_percents=1.0):
rnd_indices = np.random.randint(0, len(labels_x), len(labels_x)*amount_percents)
new_df = pd.concat([data, data.ix[labels_x[rnd_indices]]], ignore_index=True)
new_lb = pd.concat([labels, labels.ix[labels_x[rnd_indices]]], ignore_index=True)
new_df = pd.concat([new_df, new_lb], axis=1)
# Shuffle
new_df = pd.concat([new_df[:1], new_df[1:].sample(frac=1)]).reset_index(drop=True)
return new_df.iloc[:, :-1], new_df.iloc[:,-1]
def oversample_1(data, labels, amount_percents):
label_idx = labels[labels == 1].index.values
return oversample_x(data, labels, label_idx, amount_percents)
def oversample_0(data, labels, amount_percents):
label_idx = labels[labels == 0].index.values
return oversample_x(data, labels, label_idx, amount_percents)
In [38]:
train_data_3, train_labels_3 = oversample_1(X_train, y_train, 3.0)
print 'Classes proportion\n', train_labels_3.value_counts()
train_learning_curve(train_data_3, train_labels_3, default_params=default_params)
Compare with results of cross-validation.
In [139]:
cv_results = cross_validation(train_data_3, train_labels_3, num_rounds=23)
cv_results.tail(5)
Out[139]:
In [39]:
train_data_6, train_labels_6 = oversample_1(X_train, y_train, 6.0)
print 'Classes proportion\n', train_labels_6.value_counts()
train_learning_curve(train_data_6, train_labels_6, default_params=default_params)
Compare with results of cross-validation.
In [40]:
cv_results = cross_validation(train_data_6, train_labels_6, num_rounds=23)
cv_results.tail(5)
Out[40]:
In [41]:
train_data_11, train_labels_11 = oversample_1(X_train, y_train, 11.2)
print u'Classes proportion\n', train_labels_11.value_counts()
train_learning_curve(train_data_11, train_labels_11, default_params=default_params)
Compare with results of cross-validation.
In [42]:
cv_results = cross_validation(train_data_11, train_labels_11, num_rounds=23)
cv_results.tail(5)
Out[42]:
Compare with results on hold-out dataset.
In [54]:
params = {'objective': 'binary:logistic',
'max_depth': 2,
'learning_rate': 0.3,
'silent': 1,
'n_estimators': 23,
'scale_pos_weight':float(sum(y_train == 0))/sum(y_train == 1),
'seed': 123 }
train_data_11, train_labels_11 = oversample_1(X_train, y_train, 11.0)
gtree = xgb.XGBClassifier(**params)
gtree.fit(train_data_11, train_labels_11, eval_metric='auc')
print 'Hold-out set, roc-auc: ', roc_auc_score(y_hold_out, gtree.predict_proba(X_hold_out)[:, 1])
As can be seen from the graphs of the cross-validation and cross-validation, the more data and the more balanced the sample is, the higher the score roc_auc, but the score on the hold-out sample is not so optimistic and makes it clear that increasing the number of the same data of a smaller class may lead to overfitting. Why is this happening? Maybe it's the fact that I have enable the option of balancing classes in XGBoost and because of this, the saturation of the sample the same data leads to overfitting. Switching on and off 'scale_pos_weight' parameter of XGBoost confirms that assumption.
Apply undersampling technique to sampling: for this you need to remove number of objects of larger class from the sample so that the classes proportion has changed.
Looking ahead I will say that after a little research I found out that the model is most mistaken for the objects, which had a lot of blanks. Moreover, such objects are among the class of "0". I think it will be convenient to reduce the number of objects larger class where there is likelyhood to make a mistake when you restore blanks.
In [55]:
def get_fresh_data():
fresh_data = pd.read_csv('orange_small_churn_train_data.csv', index_col='ID')
fresh_data.loc[fresh_data['labels'] == -1, 'labels'] = 0
return drop_empty(fresh_data)
def standart_fill_nans(data):
df_new = data.copy(deep=True)
num_features = df_new.select_dtypes(include=['float64', 'int64']).columns
cat_features = df_new.select_dtypes(include=['object']).columns
df_new[num_features] = df_new[num_features].fillna(0)
df_new[cat_features] = df_new[cat_features].fillna('NaN').apply(lambda x: LabelEncoder().fit_transform(x))
return df_new
In [56]:
def undersample_0(amount_nans=50):
new_df = get_fresh_data()
train_data, train_labels = new_df.iloc[:, :-1], new_df.iloc[:,-1]
labels_0_idx = train_labels[train_labels == 0].index
to_drop_idxs = []
for row in train_data.ix[labels_0_idx].iterrows():
if row[1].isnull().sum() > amount_nans:
to_drop_idxs.append(row[0])
train_data, train_labels = train_data.drop(train_data.index[to_drop_idxs]), train_labels.drop(train_labels.index[to_drop_idxs])
return train_data, train_labels
In [57]:
train_data_und_40, train_labels_und_40 = undersample_0(40)
train_data_und_40 = standart_fill_nans(train_data_und_40)
print u'Classes proportion\n', train_labels_und_40.value_counts()
train_learning_curve(train_data_und_40, train_labels_und_40, default_params=default_params)
Compare with results of cross-validation.
In [58]:
cv_results = cross_validation(train_data_und_40, train_labels_und_40, num_rounds=23)
cv_results.tail(5)
Out[58]:
In [59]:
train_data_und_10, train_labels_und_10 = undersample_0(10)
train_data_und_10 = standart_fill_nans(train_data_und_10)
print u'Classes proportion\n', train_labels_und_10.value_counts()
train_learning_curve(train_data_und_10, train_labels_und_10, default_params=default_params)
Compare with results of cross-validation.
In [60]:
cv_results = cross_validation(train_data_und_10, train_labels_und_10, num_rounds=23)
cv_results.tail(5)
Out[60]:
Compare with results on hold-out dataset.
In [169]:
params = {'objective': 'binary:logistic',
'max_depth': 2,
'learning_rate': 0.3,
'silent': 1,
'n_estimators': 23,
'seed': seed }
train_data_ho, train_labels_ho = undersample_0(10)
train_data_ho = standart_fill_nans(train_data_ho)
gtree = xgb.XGBClassifier(**params)
gtree.fit(train_data_ho, train_labels_ho, eval_metric='auc')
print 'Hold-out set, roc-auc: ', roc_auc_score(y_hold_out, gtree.predict_proba(X_hold_out)[:, 1])
In [170]:
features_sample = ['Var6', 'Var7', 'Var149', 'Var109']
fig, axes = plt.subplots(1, 4, figsize=(20, 5))
for i in xrange(4):
sns.kdeplot(X_train.loc[y_train == 0, features_sample[i]], color = 'r', ax=axes[i], label=0)
sns.kdeplot(X_train.loc[y_train == 1, features_sample[i]], color = 'b', ax=axes[i], label=1)
axes[i].set(xlabel=features_sample[i])
fig.tight_layout()
So, the graphs of the density distribution for randomly chosen features can be seen that even for balanced sampling density can vary. Let us restore the data separately by classes
Fill-in the numerical features blanks by average values
In [174]:
new_df = get_fresh_data()
train_data, train_labels = new_df.iloc[:, :-1], new_df.iloc[:,-1]
data_label_0 = train_data[train_labels==0]
data_label_1 = train_data[train_labels==1]
train_data[train_labels==0] = data_label_0[num_features].fillna(data_label_0[num_features].mean())
train_data[train_labels==1] = data_label_1[num_features].fillna(data_label_1[num_features].mean())
train_data[cat_features] = train_data[cat_features].fillna('NaN').apply(lambda x: LabelEncoder().fit_transform(x))
print 'Any values missing? ', train_data.isnull().values.any()
print 'Classes proportion\n', train_labels.value_counts()
In [175]:
cv_results = cross_validation(train_data, train_labels, num_rounds=23)
cv_results.tail(5)
Out[175]:
Scale out numerical features. On one hand, scaling is very useful when working with linear models, and when working with trees may not help, but let's look at the result
Scale-out by RobustScaler, which can help if values in features have large emissions.
In [178]:
new_df = get_fresh_data()
train_data, train_labels = new_df.iloc[:, :-1], new_df.iloc[:,-1]
data_label_0 = train_data[train_labels==0]
data_label_1 = train_data[train_labels==1]
train_data[train_labels==0] = data_label_0[num_features].fillna(0).apply(RobustScaler().fit_transform)
train_data[train_labels==1] = data_label_1[num_features].fillna(0).apply(RobustScaler().fit_transform)
train_data[cat_features] = train_data[cat_features].fillna('NaN').apply(lambda x: LabelEncoder().fit_transform(x))
print 'Any values missing? ', train_data.isnull().values.any()
print 'Classes proportion\n', train_labels.value_counts()
Cross-validation check.
In [179]:
cv_results = cross_validation(train_data, train_labels, num_rounds=23)
cv_results.tail(5)
Out[179]:
Scale-out by StandardScaler, transforming data to a standard distribution
In [181]:
new_df = get_fresh_data()
train_data, train_labels = new_df.iloc[:, :-1], new_df.iloc[:,-1]
data_label_0 = train_data[train_labels==0]
data_label_1 = train_data[train_labels==1]
train_data[train_labels==0] = data_label_0[num_features].fillna(0).apply(StandardScaler().fit_transform)
train_data[train_labels==1] = data_label_1[num_features].fillna(0).apply(StandardScaler().fit_transform)
train_data[cat_features] = train_data[cat_features].fillna('NaN').apply(lambda x: LabelEncoder().fit_transform(x))
print 'Any values missing? ', train_data.isnull().values.any()
print 'Classes proportion\n', train_labels.value_counts()
Cross-validation check.
In [182]:
cv_results = cross_validation(train_data, train_labels, num_rounds=23)
cv_results.tail(5)
Out[182]:
As we can see, each of the tested methods may lead to overfitting of the model. We assume that the recovery by class not working. Confirm this fact with testing on hold-out sample
In [209]:
new_df = get_fresh_data()
data, labels = new_df.iloc[:, :-1], new_df.iloc[:,-1]
num_features = data.select_dtypes(include=['float64', 'int64']).columns
cat_features = data.select_dtypes(include=['object']).columns
data[cat_features] = data[cat_features].fillna('NaN').apply(lambda x: LabelEncoder().fit_transform(x))
X_train, X_hold_out, y_train, y_hold_out = train_test_split(data, labels, test_size=0.25,
random_state=seed, stratify=labels)
data_label_0 = X_train.loc[y_train==0, num_features]
data_label_1 = X_train.loc[y_train==1, num_features]
data_label_0 = data_label_0.fillna(data_label_0.mean())
data_label_1 = data_label_1.fillna(data_label_1.mean())
X_hold_out[num_features] = X_hold_out[num_features].fillna(X_hold_out[num_features].mean())
print 'Any values missing? ', X_hold_out.isnull().values.any()
params = {'objective': 'binary:logistic',
'max_depth': 2,
'learning_rate': 0.3,
'silent': 1,
'n_estimators': 23,
'scale_pos_weight':float(sum(y_train == 0))/sum(y_train == 1),
'seed': seed }
gtree = xgb.XGBClassifier(**params)
gtree.fit(X_train, y_train, eval_metric='auc')
print 'Hold-out set, roc-auc: ', roc_auc_score(y_hold_out, gtree.predict_proba(X_hold_out)[:, 1])
Testing has shown that it is best to fill in the blanks with zeros. This is not surprising because if you look at the mode of our data, it is easy to see that the majority of numerical features the number with the highest frequency is number zero. However, the recovery features's blanks of the individual classes was not justified.
Scaling of numerical features has had little influence on the result.
Before experiments with categorical features, let's look at their density distribution. This will help us to form hypotheses how to work with them and maybe it will become clear what categorical features better get rid of
In [225]:
fig, axes = plt.subplots(8, 4, figsize=(20, 30))
for i in xrange(len(cat_features)):
sns.distplot(data.loc[labels == 0, cat_features[i]], bins=30, kde=False, ax=axes[i / 4, i % 4], label='0')
sns.distplot(data.loc[labels == 1, cat_features[i]], bins=30, kde=False, ax=axes[i / 4, i % 4], label='1')
axes[i / 4, i % 4].set(xlabel=cat_features[i])
fig.tight_layout()
Estimate values frequencies in categorical features.
In [232]:
df[cat_features].describe()
Out[232]:
As we can see the number of unique values in some categories more than a few thousand: Var198, Var200, Var220, Var222 and others. Such number of unique values is not very helpfull in calculating the class to which the object belongs to and looks more like noise to be rid of. Remove features with number of categories greater than 30.
Code categorially features by using binarization, we introduce a separate feature-marker for empty values
In [15]:
def get_cat_bin_data():
new_df = get_fresh_data()
train_data, train_labels = new_df.iloc[:, :-1], new_df.iloc[:,-1]
num_features = train_data.select_dtypes(include=['float64', 'int64']).columns
#feature-marker for empty values
num_features_na = list(map(lambda name: name+'_na', num_features))
train_data[num_features_na] = train_data[num_features].apply(lambda x: pd.isnull(x).astype(int))
train_data[num_features] = train_data[num_features].fillna(0)
cat_features = train_data.select_dtypes(include=['object']).columns
to_drop = []
for cat_col in cat_features:
uniques = len(train_data[cat_col].unique())
if uniques > 30:
to_drop.append(cat_col)
train_data = train_data.drop(to_drop, axis=1)
cat_features = [col for col in cat_features if col not in to_drop]
# Binarizing of cat.+ na
dummies_cat = pd.get_dummies(train_data[cat_features], dummy_na=True)
train_data = train_data.drop(cat_features, axis=1)
train_data = pd.concat([train_data, dummies_cat], ignore_index=False, axis=1)
return train_data, train_labels
In [64]:
train_data, train_labels = get_cat_bin_data()
print 'New shape:', train_data.shape
X_train, X_valid, y_train, y_valid = train_test_split(train_data, train_labels, test_size=0.25,
random_state=seed, stratify=train_labels)
print 'Any missing values? ', X_train.isnull().values.any()
print 'Classes proportion\n', y_train.value_counts()
cv_results = cross_validation(X_train, y_train, num_rounds=23)
cv_results.tail(5)
Out[64]:
The results on the hold-out sample with dummy coding quite a bit has fallen in comparison with LabelEncoder, but this result still does not figure. Things can change after selection of the parameters and setting the model, because now all the experiments going on basline
In [65]:
params = {'objective': 'binary:logistic',
'max_depth': 2,
'learning_rate': 0.3,
'silent': 1,
'n_estimators': 23,
'scale_pos_weight':float(sum(y_train == 0))/sum(y_train == 1),
'seed': seed }
gtree = xgb.XGBClassifier(**params)
gtree.fit(X_train, y_train, eval_metric='auc')
print 'Hold-out set, roc-auc: ', roc_auc_score(y_valid, gtree.predict_proba(X_valid)[:, 1])
All features have been useful to build models? Conduct the procedure of selecting features, try different options of selection.
For features selecting I will use several approaches. The first is already built into XGBoost is the ability to look at the importance of the features by the number of times when he was involved in the split. Second, is the selection K-best of features by Mutual Information or the Fisher test. The third is a greedy selection of features based on the model LogisticRegression.
Selection by XGBoost
In [66]:
params = {'objective': 'binary:logistic',
'max_depth': 2,
'learning_rate': 0.3,
'silent': 1,
'n_estimators': 23,
'scale_pos_weight':float(sum(y_train == 0))/sum(y_train == 1),
'eval_metric':'auc',
'seed': seed }
dtrain = xgb.DMatrix(data=X_train, label=y_train, silent=False, feature_names=X_train.columns)
dtest = xgb.DMatrix(data=X_valid, label=y_valid, silent=False, feature_names=X_valid.columns)
watchlist = [(dtest,'test'), (dtrain,'train')]
# bst = xgb.train(params, dtrain, 40, watchlist)
bst = xgb.train(params, dtrain, 40)
importances = bst.get_fscore()
importance_df = pd.DataFrame({
'Splits': list(importances.values()),
'Feature': list(importances.keys())
})
importance_df.sort_values(by='Splits', inplace=True)
importance_df.plot(kind='barh', x='Feature', figsize=(8,10), color='orange')
print "\nOptimal number of features : %d" % len(importances.values())
Memorize the most important signs, that will come in handy in the future
In [67]:
k_v = zip(importances.values(), importances.keys())
k_v.sort(key = lambda t: t[0], reverse=True)
kv_importances = [v[1] for v in k_v]
kv_importances
Out[67]:
In [20]:
params = {'objective': 'binary:logistic',
'max_depth': 2,
'learning_rate': 0.3,
'silent': 1,
'n_estimators': 23,
'seed': seed }
gtree = xgb.XGBClassifier(**params)
gtree.fit(X_train[kv_importances], y_train, eval_metric='auc')
print 'Hold-out set, roc-auc: ', roc_auc_score(y_valid, gtree.predict_proba(X_valid[kv_importances])[:, 1])
K-best selection of features by Mutual Information or the Fisher test
In [489]:
params = {
'objective': 'binary:logistic',
'max_depth': 2,
'learning_rate': 0.3,
'silent': True,
# 'scale_pos_weight':float(sum(y_train == 0))/sum(y_train == 1),
'n_estimators':23,
'seed': seed
}
pipeline = Pipeline([
('selector', SelectKBest()),
('boost', xgb.XGBClassifier(**params))
])
parameters_grid = {
'selector__score_func': [f_classif, mutual_info_classif],
'selector__k': [90, 100, 120]
}
grid_cv = GridSearchCV(pipeline, parameters_grid, scoring='roc_auc', n_jobs=-1,
cv = StratifiedShuffleSplit(n_splits=5,test_size=0.25,random_state=seed))
grid_cv.fit(X_train, y_train)
print u'Best AUC score: ', grid_cv.best_score_
print u'Best params: ',grid_cv.best_params_
Hold-out sample estimation
In [23]:
params = {'objective': 'binary:logistic',
'max_depth': 2,
'learning_rate': 0.3,
'silent': True,
'n_estimators': 23,
'scale_pos_weight':float(sum(y_train == 0))/sum(y_train == 1),
'seed': seed }
kbest = SelectKBest(k=90, score_func=f_classif)
X_train_kbest = kbest.fit_transform(X_train, y_train)
X_valid_kbest = kbest.transform(X_valid)
gtree = xgb.XGBClassifier(**params)
gtree.fit(X_train_kbest, y_train, eval_metric='auc')
print 'Hold-out set, roc-auc: ', roc_auc_score(y_valid, gtree.predict_proba(X_valid_kbest)[:, 1])
Finally, we will conduct a greedy selection of features with help of LogisticRegression
In [70]:
from sklearn.linear_model import LogisticRegression
X_train_scaled = X_train
X_train_scaled[num_features] = RobustScaler().fit_transform(X_train[num_features])
rfecv = RFECV(estimator=LogisticRegression(class_weight='balanced'),
step=1,
cv=StratifiedShuffleSplit(n_splits=5,test_size=0.3,random_state=seed),
scoring='roc_auc',
n_jobs=-1)
rfecv.fit(X_train_scaled, y_train)
# Plot number of features VS. cross-validation scores
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (nb of correct classifications)")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()
print("Optimal number of features : %d" % rfecv.n_features_)
In [71]:
selected_idxes = np.where( rfecv.support_ == True )
rfe_selected = X_train.columns.values[selected_idxes]
print rfe_selected
In [74]:
params = {'objective': 'binary:logistic',
'max_depth': 2,
'learning_rate': 0.3,
'silent': True,
'n_estimators': 23,
'scale_pos_weight':float(sum(y_train == 0))/sum(y_train == 1),
'seed': seed }
gtree = xgb.XGBClassifier(**params)
gtree.fit(X_train[rfe_selected], y_train, eval_metric='auc')
print 'Hold-out set, roc-auc: ', roc_auc_score(y_valid, gtree.predict_proba(X_valid[rfe_selected])[:, 1])
Conclusions: So, different approaches have selected different number of features and showed different results on the hold-out sample. Greedy selection, 169, auc: 0.5685, selection by splits in the XGBoost - 40, auc: 0.7219. SelectKBest using Fisher 90, auc: 0.7247.
Select the optimal model parameters. Note that depending on how you processed the raw data, did the class balancing, how many objects are left in the training set, the optimal values of the parameters can be changed. Take the best of decisions at the moment, and will carry out the procedure of selection of parameters of the model using hyperopt.
In [26]:
from hyperopt import hp
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
def score(params):
xgb_tree = xgb.XGBClassifier(**params)
eval_set = [(X_train, y_train), (X_valid, y_valid)]
xgb_tree.fit(X_train, y_train,
eval_set=eval_set, eval_metric="auc",
early_stopping_rounds=30, verbose=False)
pred = xgb_tree.predict_proba(X_valid)[:,1]
auc = roc_auc_score(y_valid, pred)
return{'loss':1-auc, 'status': STATUS_OK }
def optimize(trials):
space = {
'n_estimators' : hp.choice('n_estimators', np.arange(10, 100, dtype=int)),
'learning_rate' : hp.quniform('learning_rate', 0.025, 0.5, 0.025),
'max_depth' : hp.choice('max_depth', np.arange(2, 8, dtype=int)),
'min_child_weight' : hp.quniform('min_child_weight', 1, 6, 1),
'subsample' : hp.quniform('subsample', 0.3, 0.8, 0.05),
'gamma' : hp.quniform('gamma', 0.5, 1, 0.05),
'colsample_bytree' : hp.quniform('colsample_bytree', 0.3, 0.8, 0.05),
'objective': 'binary:logistic',
'nthread' : 7,
'silent' : 1,
'scale_pos_weight':float(sum(y_train == 0))/sum(y_train == 1),
'seed': seed
}
best = fmin(score, space, algo=tpe.suggest, trials=trials, max_evals=500)
print best
trials = Trials()
optimize(trials)
Set the new, optimized default settings, test on hold-out sample our optimization is not made worse.
In [73]:
params = {
'objective': 'binary:logistic',
'colsample_bytree': 0.55,
'learning_rate': 0.1,
'min_child_weight': 6.0,
'n_estimators': 59,
'subsample': 0.65,
'max_depth': 2,
'gamma': 0.95,
'scale_pos_weight':12.44,
'seed':seed
}
gtree = xgb.XGBClassifier(**params)
gtree.fit(X_train, y_train, eval_metric='auc')
predict_prb = gtree.predict_proba(X_valid)[:, 1]
print 'Hold-out set, roc-auc: ', roc_auc_score(y_valid, predict_prb)
As you can see to find the optimal settings with hyperopt on 300 iterations failed. You may want to increase the number of iterations to narrow down the number of parameters, but this time we are not going to do it
Let's look at the objects. On what objects achieved the highest classification error? Are these objects have something in common? Can you see any patterns? Why the largest error is achieved at these facilities? In this case, "largest" error can be understood as the classification of an object with someone else's class with great certainty (high probability).
In [75]:
gtree = xgb.XGBClassifier(**params)
gtree.fit(X_train, y_train, eval_metric='auc')
predict_prb = gtree.predict_proba(X_valid)[:, 1]
df_new = pd.concat([X_valid])
df_new['target'] = y_valid
df_new['prediction'] = pd.Series(gtree.predict(X_valid), index=X_valid.index)
df_new['prediction_proba'] = pd.Series(predict_prb, index=X_valid.index)
df_new.head()
Out[75]:
Find the indices of objects which do not match the target label and the predicted label
In [840]:
not_same=df_new[((df_new.target == 1) & (df_new.prediction == 0)) | ((df_new.target == 0) & (df_new.prediction == 1))]
Investigate the behavior of the most important numerical features which have selected by the model. For this visualize the distribution of feature values. Blanks will be marked as it is noticeable, for example, by number -70
In [841]:
new_df, _ = get_fresh_data()
train_data, train_labels = new_df.iloc[:, :-1], new_df.iloc[:,-1]
num_features = ['Var126','Var113','Var81','Var73','Var189','Var74','Var133','Var6',
'Var57','Var65','Var28','Var94','Var140','Var163','Var13','Var153',
'Var83','Var119','Var123','Var38']
train_data[num_features] = train_data[num_features].fillna(-70)
fig, axes = plt.subplots(5, 4, figsize=(20, 20))
for i in range(len(num_features)):
xx = train_data.ix[not_same.index][num_features[i]]
sns.distplot(xx, kde=False,ax=axes[i / 4, i % 4], color = 'r', label='0')
axes[i / 4, i % 4].set(xlabel=num_features[i])
fig.tight_layout()
Conclusions:
As can be seen from the distributions, the objects on which mistakes were made combine two things: 1. They have a lot of blanks 2. There are large emissions near zero.
Restoration of missing data:
It seems to me, and it was repeatedly confirmed in the course of research - good data is everything! The main problem I faced is the correct restoration of blanks and work with distortions in the data distributions. I have considered not all methods and I think we should try others, for example Bayesian methods.
Try other models:
I like XGBoost and that's why I chose it as a base model for building experiments. But this does not mean that other models are bad. My experiments with undersampling, oversampling, data recovery by class, gave weak results of the experiments but to push me to the idea that these techniques may work poorly on gradient boosting only. But may well prove themselves in the linear models.
Another example, which pushes me to think that you can try the linear model is an experiment with the binarize categorical features and pseudocategories numerical features. Such binarization works very well in LogisticRegression, but did not give a good gain in applying for XGBoost.
In my experiments, I didn't use ensembles of models. Very often, especially in competitions on Kaggle, the models with different predictive abilities collected into one ensemble can give a significant improve in prediction score.
Fine tuning of the model:
If you look at the histogram of the importance of the features by which XGBoost did the splits and the distribution of values in these features, it becomes obvious that XGBoost somewhere too adjusted to these imbalances and we have to help him to process such data. How? I haven't decided yet, XGBoost is an extremely powerful tool to build a Gradient Boosting, but I invested far too little time to study and understand the nuances of his work.
New features generation:
In my work I have generated new features mainly by binarizing category features and adding new feature as marker of blank value in the object. There are other approaches to the generation of new features, for example by using clustering, union of similar(correlated) features, features of frequency of the repetition of values in features and so on. All of above + the experiments with the different models can give an improvement of AUC.