Artem Zaika: KDD Cup 2009 - Customer relationship prediction

For the study data were taken from the competition KDD 2009 on Customer Relationship prediction

Task Description

The task is to estimate the churn of customers, hence there are two target values to be predicted. The challenge is staged in phases to test the rapidity with which each team is able to produce results. A large number of variables (15,000) is made available for prediction. However, to engage participants having access to less computing power, a smaller version of the dataset with only 230 variables will be made available in the second part of the challenge.

Churn (wikipedia definition): Churn rate is also sometimes called attrition rate. It is one of two primary factors that determine the steady-state level of customers a business will support. In its broadest sense, churn rate is a measure of the number of individuals or items moving into or out of a collection over a specific period of time. The term is used in many contexts, but is most widely applied in business with respect to a contractual customer base. For instance, it is an important factor for any business with a subscriber-based service model, including mobile telephone networks and pay TV operators. The term is also used to refer to participant turnover in peer-to-peer networks.

The small dataset will be made available at the end of the fast challenge. Both training and test sets contain 50,000 examples. The data are split similarly for the small and large versions, but the samples are ordered differently within the training and within the test sets. Both small and large datasets have numerical and categorical variables. For the large dataset, the first 14,740 variables are numerical and the last 260 are categorical. For the small dataset, the first 190 variables are numerical and the last 40 are categorical. Toy target values are available only for practice purpose. The prediction of the toy target values will not be part of the final evaluation.

Evaluation

The performances are evaluated according to the arithmetic mean of the AUC for the hurn. This is what we call "Score".


In [1]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import xgboost as xgb
from statsmodels.stats.weightstats import *
from statsmodels.stats.proportion import proportion_confint
from scipy import stats

from collections import defaultdict, Counter
import itertools
from sklearn.preprocessing import StandardScaler, RobustScaler, LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score
from sklearn.feature_selection import SelectKBest,RFECV, SelectFromModel
from sklearn.feature_selection import mutual_info_classif,  f_classif
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.model_selection import validation_curve, learning_curve
from sklearn.model_selection import ShuffleSplit, StratifiedShuffleSplit, StratifiedKFold

import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

seed = 123
np.random.seed(seed)

In [2]:
xgb.__version__


Out[2]:
'0.4'

Сonduct primary data analysis. Look at the data validity and omissions


In [3]:
df = pd.read_csv('orange_small_churn_train_data.csv', index_col='ID')
df.head()


Out[3]:
Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10 ... Var222 Var223 Var224 Var225 Var226 Var227 Var228 Var229 Var230 labels
ID
0 NaN NaN NaN NaN NaN 3052.0 NaN NaN NaN NaN ... vr93T2a LM8l689qOp NaN NaN fKCe 02N6s8f xwM2aC7IdeMC0 NaN NaN -1
1 NaN NaN NaN NaN NaN 1813.0 7.0 NaN NaN NaN ... 6hQ9lNX LM8l689qOp NaN ELof xb3V RAYp 55YFVY9 mj86 NaN -1
2 NaN NaN NaN NaN NaN 1953.0 7.0 NaN NaN NaN ... catzS2D LM8l689qOp NaN NaN FSa2 ZI9m ib5G6X1eUxUn6 mj86 NaN -1
3 NaN NaN NaN NaN NaN 1533.0 7.0 NaN NaN NaN ... e4lqvY0 LM8l689qOp NaN NaN xb3V RAYp F2FyR07IdsN7I NaN NaN 1
4 NaN NaN NaN NaN NaN 686.0 7.0 NaN NaN NaN ... MAz3HNj LM8l689qOp NaN NaN WqMG RAYp F2FyR07IdsN7I NaN NaN -1

5 rows × 231 columns

For convenience, replace the target variable with the label -1 (churn) on 0


In [4]:
df.loc[df['labels'] == -1, 'labels'] = 0

As we see in the data a lot of missing values. There are also features, which have missed all the values. When recovery permits, it is important to have a reliable allocation, but if the blanks in features of more than 60%, then I think such a distribution can not be trusted and means to restore such signs are useless, because they introduce only noise in the model

Remove features of having gaps in more than 65% of cases


In [5]:
def drop_empty(df):
    has_nan = df.isnull().sum()
    empty_cols = has_nan[has_nan >= len(df) * 0.65]
    to_drop = empty_cols.index.values.tolist()

    df.drop(to_drop, axis=1, inplace = True)
    return df

df = drop_empty(df)
df.head()


Out[5]:
Var6 Var7 Var13 Var21 Var22 Var24 Var25 Var28 Var35 Var38 ... Var220 Var221 Var222 Var223 Var225 Var226 Var227 Var228 Var229 labels
ID
0 3052.0 NaN NaN 480.0 600.0 20.0 480.0 200.00 0.0 82752.0 ... hLKtJ9p Al6ZaUT vr93T2a LM8l689qOp NaN fKCe 02N6s8f xwM2aC7IdeMC0 NaN 0
1 1813.0 7.0 636.0 212.0 265.0 2.0 128.0 166.56 0.0 2706120.0 ... hN8KpA1 oslk 6hQ9lNX LM8l689qOp ELof xb3V RAYp 55YFVY9 mj86 0
2 1953.0 7.0 448.0 176.0 220.0 0.0 72.0 311.76 0.0 4698780.0 ... 4UxGlow zCkv catzS2D LM8l689qOp NaN FSa2 ZI9m ib5G6X1eUxUn6 mj86 0
3 1533.0 7.0 4.0 332.0 415.0 0.0 144.0 220.08 5.0 864384.0 ... ylCK5YS oslk e4lqvY0 LM8l689qOp NaN xb3V RAYp F2FyR07IdsN7I NaN 1
4 686.0 7.0 0.0 160.0 200.0 2.0 48.0 278.00 0.0 4364880.0 ... PYpzAu9 oslk MAz3HNj LM8l689qOp NaN WqMG RAYp F2FyR07IdsN7I NaN 0

5 rows × 75 columns


In [6]:
data, labels = df.iloc[:, :-1], df.iloc[:,-1]

Calculate proporions of classes “churn” and “not churn”.


In [29]:
length = data.shape[0]
label_counts = labels.value_counts()

print 'Proporion of class “not churn”:', float(label_counts[0]) / length
print 'Proporion of class “churn”:', float(label_counts[1]) / length


Proporion of class “not churn”: 0.9256
Proporion of class “churn”: 0.0744

As we see, proportion of classes 1 is more than proportion of class 0, in almost 130 times. It's not good as for classification it is important that classes in the sample were balanced. Ie in our case the number of labels class 1 - 'not churn', the proportion must be equal to the number of tags of class 0 - 'churn'. If the balance is skewed toward larger class, models, especially linear, can be too retrained to detect class with higher proportion and ignore the small one.

Later on we will continue to conduct experiments to increase number of objects of smaller class and see how it will affect the model evaluation.

2. The primary analysis of the features, identifying patterns in distributions

There are two types of features in our sample- real and categorical. Separate them and, by default, fill in the blanks in the numerical features with zeros and categorical features with string - 'NaN'


In [8]:
num_features = data.select_dtypes(include=['float64', 'int64']).columns
cat_features = data.select_dtypes(include=['object']).columns

data[num_features] = data[num_features].fillna(0)
data[cat_features] = data[cat_features].fillna('NaN')

Evaluate the density distribution of the classes. This will allow first, to assess the type of distribution(this can be useful when restoring blanks), and the ratio of distributions of features for different classes.


In [9]:
fig, axes = plt.subplots(11, 4, figsize=(20, 30))
for i in xrange(len(num_features)):
    sns.kdeplot(data.loc[labels == 0, num_features[i]], color = 'r', ax=axes[i / 4, i % 4], label=0)
    sns.kdeplot(data.loc[labels == 1, num_features[i]], color = 'b', ax=axes[i / 4, i % 4], label=1)
    axes[i / 4, i % 4].set(xlabel=num_features[i])
fig.tight_layout()


So, what can we say about the densities of the distributions? They do not belong to a normal distribution and some distribution, for example Var189, Var73, Var126 clearly suggests that the distribution of classes 0 and 1 for these features are different, so could correlate differently with the target label

Why I started talking about the correlation between the features and target label. The thing is that in the calculation of dependencies of target labels from features, it is highly desirable to do so, so the features are poorly correlated among themselves, but were highly correlated with the target label. If features are not highly correlated with the target label, it is indicative of their usefulness in prediction. If the features correlate strongly with each other, it suggests redundancy and can negative affect the quality of prediction. Thus, highly correlated features can be simplified or somehow express each other and it can have a positive effect in forecasting.

2.1 Numerical features

Because the target variable is binary (classes do not churn/churn), will not fit the calculation of the Pearson correlation (or any other), so calculate the difference of mathematical expectation of feature for both classes of the target variable. Distribution of numerical features can be very different (see below), so as a measure of relatedness between the target variable and the feature will be used the significance level of the hypothesis that the characteristic distribution for both classes (the target variable) are the same. Using the Mann-Whitney test get that the mathematical expectation of 24 features for both classes are different at significance level 0.01:


In [10]:
# check whether average is significantly different for both classes in each feature
pvals = []
for i, col_name in enumerate(num_features):
    class0 = data.loc[labels == 0, col_name]
    class1 = data.loc[labels == 1, col_name]
    
    pvals.append(stats.mannwhitneyu(class0, class1)[1])

pvals_df = pd.DataFrame({'p_value': pvals}, index=num_features).sort_values('p_value', ascending=True)

# the number of variables that have statistically significantly different average
sum(pvals_df['p_value'] < 0.01)


Out[10]:
24

In [91]:
top20 = pvals_df[:20].index.tolist()
pvals_df.head()


Out[91]:
p_value
Var74 1.712714e-48
Var13 3.932649e-36
Var73 5.239055e-33
Var7 1.029381e-31
Var140 1.835457e-31

Graphs of the density distributions of the first 8 most different features(the weakest correlated) depending on the target variable:


In [92]:
fig, axes = plt.subplots(2, 4, figsize=(16, 6))
for i in range(8):
    sns.kdeplot(data.loc[labels == 0, pvals_df.index[i]], color = 'r', ax=axes[i / 4, i % 4], label=0)
    sns.kdeplot(data.loc[labels == 1, pvals_df.index[i]], color = 'b', ax=axes[i / 4, i % 4], label=1)
    axes[i / 4, i % 4].set(xlabel=pvals_df.index[i])
fig.tight_layout()


Numeric features have a lot of blanks, so the lack of values in a numeric variable can also be a good marker-feature for the model. For example, one of the previous feature, Var126, we compare the proportion of positive class for the whole sample and among the missing values of the feature Var126:


In [27]:
print "Proportion of class churn", float(label_counts[1]) / length

print "Proportion of class churn for missed values of feature Var126:", np.round(labels[df.Var126.isnull()].mean(), 3)


Proportion of class churn 0.0744
Proportion of class churn for missed values of feature Var126: 0.113

To be sure that the proportions differ significantly, considering a confidence interval of Wilson for the fraction of missing values of the characteristic Var126:


In [28]:
proportion = labels[df.Var126.isnull()].mean() 
n = sum(df.Var126.isnull())
print "Confidence interval:", np.round(proportion_confint(proportion*n, n, alpha = 0.05, method="wilson"), 4)


Confidence interval: [ 0.1068  0.1185]

2.2 Categorical features

Categorical variables are very different, some of them has 10 to >13 thousand unique values. To check which of the variables can be useful for us let's calculate for each unique value of a categorical feature the proportion of positive class, and construct a confidence interval for this proportion. If the proportion of the original dataset is missing in the interval, then this feature (with these unique values) may be useful for the model.


In [95]:
# the Cramer's correlation coefficient for categorical variables
# http://stackoverflow.com/questions/20892799/using-pandas-calculate-cramérs-coefficient-matrix
def cramers_corrected_stat(confusion_matrix):
    """ calculate Cramers V statistic for categorial-categorial association.
        uses correction from Bergsma and Wicher, 
        Journal of the Korean Statistical Society 42 (2013): 323-328
    """
    chi2 = stats.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum()
    phi2 = chi2/n
    r,k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))    
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))


# wrapper for the function above
def get_cramers_cor(var1, var2):
    if var1.equals(var2):
        return 1.0
    confusion_matrix = pd.crosstab(var1, var2)
    return cramers_corrected_stat(np.array(confusion_matrix))

# build Wilson's confidence interval and check whether basline belongs to it
def in_conf_int(baseline, proportion, n, alpha = 0.05):
    conf_int = proportion_confint(proportion*n, n, alpha = alpha, method="wilson")
    if baseline > conf_int[0] and baseline < conf_int[1]:
        return True
    return False

In [96]:
# calculate correlations
conf_matrix_cor = pd.DataFrame(1.0, index=cat_features, columns=cat_features)
for feat in itertools.combinations(cat_features, 2):
    conf_matrix_cor.loc[feat[0], feat[1]] = get_cramers_cor(data[feat[0]], data[feat[1]])
    conf_matrix_cor.loc[feat[1], feat[0]] = conf_matrix_cor.loc[feat[0], feat[1]]

# triangle mask
mask = np.zeros_like(conf_matrix_cor, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# plot params
f, ax = plt.subplots(figsize=(12, 12))

# palette
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# heat-map drawing
sns.heatmap(conf_matrix_cor, mask=mask, cmap=cmap, vmax=1.0, vmin=0.0,
            square=True, xticklabels=True, yticklabels=True,
            linewidths=.1, cbar_kws={"shrink": .5}, ax=ax)
plt.show()


Conclusions: So, we conducted primary data analysis and identified features and categorially features that can have the greatest impact on prognosis. Such analysis can be very useful when working with linear models, but when working with Trees, RandomForest and Gradient Boosting models, it may not be very useful, because these models in the preview are not sensitive to distortions in classes, and secondly use a different approach in assessing the importance of features(the so-called 'information gain').

3. Working with features, processing of missing values

Hereafter, I will use the model gradient boosting - XGBoost. It will have some influence on the data, such as:

  • Gradienty boosting is not sensitive to the difference in scale of feature values. That is, the feature can contain the values 1 and 1000 and they do not need to be scaled, as in the case of linear models.
  • Gradienty boosting in its own way determines which features to use to make the split. Sometimes these features are not very good and you need to help boosting to find useful features to split, such as adding new feature showing if feature has blanks. For categorical features - adding new feature, coded by categories frequency

Make some utility functions for visualization of cross-validation results of XGBoost.


In [32]:
# Graph results sklearn.model_selection.learning_curve
def plot_curves(train_scores, test_scores, train_sizes, x_text="num of samples"):    
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    fig = plt.figure(figsize=(10, 6), dpi=100)

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")
    plt.xlabel(x_text)
    plt.ylabel("score")
    plt.legend(loc="best")
    plt.show()
    
    i = np.argmax(test_scores_mean)
    print("Best cross-validation result ({0:.2f}) obtained for {1} ".format(test_scores_mean[i], train_sizes[i]))
    
# visualize score depending on number of objects 
def train_learning_curve(X_train, y_train, default_params=None):
    scale_pos_weight = float(sum(y_train == 0))/sum(y_train == 1)
        
    if default_params is None:
        default_params = {
            'objective': 'binary:logistic',
            'max_depth': 2,
            'learning_rate': 0.3,
            'silent': 1.0,
            'scale_pos_weight':scale_pos_weight,
            'n_estimators':23,
            'seed': seed
            }
        
    default_params['scale_pos_weight'] = scale_pos_weight
    train_sizes, train_scores, valid_scores = learning_curve(
        xgb.XGBClassifier(**default_params), 
        X_train, y_train,        
        scoring='roc_auc', 
        cv=StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=seed), n_jobs=-1)
    
    
    plot_curves(train_scores, valid_scores, train_sizes) 
    
# Score - number of algorithms dependency visualizaion    
def train_validation_curve(X_train, y_train, est_end=200, default_params=None):
    scale_pos_weight = float(sum(y_train == 0))/sum(y_train == 1)
    
    if default_params is None:
        default_params = {
            'objective': 'binary:logistic',
            'max_depth': 2,
            'learning_rate': 0.3,
            'silent': 1.0,
            'scale_pos_weight':scale_pos_weight,
            'seed': seed
            }
        
    default_params['scale_pos_weight'] = scale_pos_weight    
    n_estimators_range = np.linspace(1, est_end, 10).astype('int')

    train_scores, test_scores = validation_curve(
        xgb.XGBClassifier(**default_params),
        X_train, y_train,
        param_name = 'n_estimators',
        param_range = n_estimators_range,
        cv=5,
        scoring='roc_auc', n_jobs=-1)
    
    plot_curves(train_scores, test_scores, n_estimators_range, x_text="num of trees")
    
# Cross-validation
def cross_validation(train, labels, default_params=None, num_rounds=23): # how many estimators 
    scale_pos_weight = float(sum(labels == 0))/sum(labels == 1)
    
    if default_params is None:
        default_params = {'objective': 'binary:logistic',
                  'max_depth': 2,
                  'learning_rate': 0.3,
                  'seed': seed}

    default_params['scale_pos_weight'] = scale_pos_weight    
    dtrain = xgb.DMatrix(data=train, label=labels, silent=False, feature_names=train.columns)
    cv_results = xgb.cv(default_params, 
              dtrain, 
              num_boost_round=num_rounds, 
              nfold=5, 
              metrics=['auc'], 
#               shuffle=True, 
#               stratified=True,
              early_stopping_rounds=5,
              seed=seed)

    return cv_results

It's time to create a delayed sample. It will be a wet towel on your face, when success hit the head due to the fact that you(or rather I) will obtain estimates for roc_auc=1.0.

There is one more thing that we need to consider. Models in Machine Learning can only work with numeric values, but not with text data. So we need to convert categorical features to numerical. The basic approach - we will encode all the blanks in the physical signs to 0, in the categorical - 'NaN' and range them using LabelEncoder().


In [30]:
data[cat_features] = data[cat_features].fillna('NaN').apply(lambda x: LabelEncoder().fit_transform(x))

X_train, X_hold_out, y_train, y_hold_out = train_test_split(data, labels, test_size=0.25, 
                                                     random_state=seed, stratify=labels) 

print 'Any missing values? ', X_hold_out.isnull().values.any()


Any missing values?  False

In [31]:
X_train.head()


Out[31]:
Var6 Var7 Var13 Var21 Var22 Var24 Var25 Var28 Var35 Var38 ... Var219 Var220 Var221 Var222 Var223 Var225 Var226 Var227 Var228 Var229
ID
22182 1106.0 7.0 92.0 168.0 210.0 0.0 48.0 286.96 0.0 3698856.0 ... 11 498 4 1781 0 1 4 2 8 1
11748 630.0 0.0 0.0 544.0 680.0 0.0 264.0 253.52 0.0 462420.0 ... 11 516 4 2402 0 1 11 2 8 0
1213 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 ... 11 1569 4 704 0 1 18 2 8 0
21427 385.0 21.0 4816.0 144.0 180.0 0.0 8.0 240.56 0.0 2019852.0 ... 11 2004 4 3519 0 1 16 2 2 2
33828 763.0 7.0 10072.0 136.0 170.0 0.0 24.0 310.32 0.0 4883400.0 ... 11 3352 4 2502 0 2 2 2 14 2

5 rows × 74 columns

The data is normal view now, let's first find the desired number of algorithms in xgboost. This is necessary in order not to waste time building unnecessary algorithms and thus to speed up the model training. Put the maximum range is 100. Leave other parameters by default.


In [123]:
default_params = {
            'objective': 'binary:logistic',
            'max_depth': 2,
            'learning_rate': 0.3,
            'silent': 1.0,            
            'seed': seed
}

train_validation_curve(X_train, y_train, est_end=100, default_params=default_params)


Best cross-validation result (0.74) obtained for 23 

As we can see, the highest rating obtained in the number of trees 23. The following tests will use the rounded number of algorithms to 24


In [37]:
default_params = {'objective': 'binary:logistic',
                  'max_depth': 2,
                  'learning_rate': 0.3,
                  'silent': 1.0,
                  'n_estimators': 24,
                  'seed': seed }

How to interpret the behavior of the curves in the graphs? Comes to the rescue of the information from the sklearn package: http://scikit-learn.org/stable/modules/learning_curve.html#learning-curve

Description of validation-curve

If the training score and the validation score are both low, the estimator will be underfitting. If the training score is high and the validation score is low, the estimator is overfitting and otherwise it is working very well. A low training score and a high validation score is usually not possible

Description of training-curve

A learning curve shows the validation and training score of an estimator for varying numbers of training samples. It is a tool to find out how much we benefit from adding more training data and whether the estimator suffers more from a variance error or a bias error. If both the validation score and the training score converge to a value that is too low with increasing size of the training set, we will not benefit much from more training data. In the following plot you can see an example: naive Bayes roughly converges to a low score.

We will probably have to use an estimator or a parametrization of the current estimator that can learn more complex concepts (i.e. has a lower bias). If the training score is much greater than the validation score for the maximum number of training samples, adding more training samples will most likely increase generalization.

To sum up, my goal of visual analysis is to make sure that the curves were on the same level, but did not go far up, because it is a signal about retraining

3.1 Evaluation of the quality of the model depending on the number of objects

Let's start simple. Let's estimate how many objects do we need to build quality models. Training is available for a fairly large sample and it may be that at some point the growth of the size of the training sample ceases to affect the quality of the model. Construct learning curves, training the model on samples of different sizes starting with a small number of objects in the training set and gradually increasing its size incrementally.

Take 30% of the original sample

In [127]:
_X_train, _, _y_train, __ = train_test_split(X_train, y_train, stratify=y_train, train_size=0.3, random_state=seed)

train_learning_curve(_X_train, _y_train, default_params=default_params)


Best cross-validation result (0.71) obtained for 7200 

Compare with results of cross-validation. The bottom line is different, but the results are quite acceptable


In [128]:
cv_results = cross_validation(_X_train, _y_train, num_rounds=23)
cv_results.tail(5)


Will train until cv error hasn't decreased in 5 rounds.
Stopping. Best iteration:
[7] cv-mean:0.7163726	cv-std:0.017247193541
Out[128]:
test-auc-mean test-auc-std train-auc-mean train-auc-std
3 0.692284 0.017534 0.727096 0.005797
4 0.702634 0.016866 0.740776 0.005643
5 0.711694 0.019528 0.751302 0.002502
6 0.715813 0.020229 0.757329 0.002340
7 0.716373 0.017247 0.762519 0.002973
Take 50% of the original sample

In [129]:
_X_train, _, _y_train, __ = train_test_split(X_train, y_train, stratify=y_train, train_size=0.5, random_state=seed)

train_learning_curve(_X_train, _y_train, default_params=default_params)


Best cross-validation result (0.74) obtained for 12000 

Compare with results of cross-validation.


In [131]:
cv_results = cross_validation(_X_train, _y_train, num_rounds=23)
cv_results.tail(5)


Will train until cv error hasn't decreased in 5 rounds.
Stopping. Best iteration:
[11] cv-mean:0.7382426	cv-std:0.0111096279434
Out[131]:
test-auc-mean test-auc-std train-auc-mean train-auc-std
7 0.728409 0.009175 0.755931 0.003308
8 0.730161 0.012515 0.758456 0.001953
9 0.734516 0.011563 0.760973 0.002775
10 0.736371 0.010152 0.763664 0.003440
11 0.738243 0.011110 0.765145 0.002992
Take 80% of the original sample

In [132]:
_X_train, _, _y_train, __ = train_test_split(X_train, y_train, stratify=y_train, train_size=0.8, random_state=seed)

train_learning_curve(_X_train, _y_train, default_params=default_params)


Best cross-validation result (0.73) obtained for 19200 

Compare with results of cross-validation.


In [133]:
cv_results = cross_validation(_X_train, _y_train, num_rounds=23)
cv_results.tail(5)


Will train until cv error hasn't decreased in 5 rounds.
Stopping. Best iteration:
[16] cv-mean:0.735579	cv-std:0.013683695875
Out[133]:
test-auc-mean test-auc-std train-auc-mean train-auc-std
12 0.734929 0.014709 0.754417 0.003734
13 0.735509 0.014280 0.756223 0.003895
14 0.734509 0.014422 0.757784 0.003743
15 0.734741 0.013284 0.759509 0.003343
16 0.735579 0.013684 0.760926 0.003233
Take 100% of the original sample

In [134]:
train_learning_curve(X_train, y_train, default_params=default_params)
print y_train.value_counts()


Best cross-validation result (0.73) obtained for 24000 
0    27768
1     2232
Name: labels, dtype: int64

Compare with results of cross-validation.


In [135]:
cv_results = cross_validation(X_train, y_train, num_rounds=23)
cv_results.tail(5)


Will train until cv error hasn't decreased in 5 rounds.
Out[135]:
test-auc-mean test-auc-std train-auc-mean train-auc-std
18 0.738404 0.010040 0.759613 0.003995
19 0.738603 0.009488 0.760870 0.003812
20 0.739502 0.009892 0.762179 0.003802
21 0.740743 0.010233 0.763355 0.003825
22 0.740498 0.010019 0.764397 0.003947

So, if we increase the sample assessment roc_auc is growing. Arrange this assessment the test of cruel reality - check it on hold-out sample


In [36]:
params = {'objective': 'binary:logistic',
                  'max_depth': 2,
                  'learning_rate': 0.3,
                  'silent': 1,
                  'n_estimators': 23,
                  'scale_pos_weight':float(sum(y_train == 0))/sum(y_train == 1),
                  'seed': seed }

gtree = xgb.XGBClassifier(**params)
gtree.fit(X_train, y_train, eval_metric='auc')

print 'Hold-out set, roc-auc: ', roc_auc_score(y_hold_out, gtree.predict_proba(X_hold_out)[:, 1])


Hold-out set, roc-auc:  0.726616736833

3.2 The elimination of the imbalance of classes by increasing the number of objects of a smaller class

Often unbalanced classes samples lead to some problems while training the models. For instance model can fit perfectly the samples of biggest class and show good accuracy score, but recall score will be very poor.

Let's try to differentiate the sample, play with the distribution of objects into classes and conclude about how the class ratio affects the quality of the model.


In [34]:
def oversample_x(data, labels, labels_x, amount_percents=1.0):
    rnd_indices = np.random.randint(0, len(labels_x), len(labels_x)*amount_percents) 
    
    new_df = pd.concat([data, data.ix[labels_x[rnd_indices]]], ignore_index=True)    
    new_lb = pd.concat([labels, labels.ix[labels_x[rnd_indices]]], ignore_index=True)
    new_df = pd.concat([new_df, new_lb], axis=1)
    # Shuffle
    new_df = pd.concat([new_df[:1], new_df[1:].sample(frac=1)]).reset_index(drop=True)     
    return new_df.iloc[:, :-1], new_df.iloc[:,-1]

def oversample_1(data, labels, amount_percents):
    label_idx = labels[labels == 1].index.values
    return oversample_x(data, labels, label_idx, amount_percents)
    
def oversample_0(data, labels, amount_percents):
    label_idx = labels[labels == 0].index.values
    return oversample_x(data, labels, label_idx, amount_percents)

Increase 3 times the number of classes of label "1"


In [38]:
train_data_3, train_labels_3 = oversample_1(X_train, y_train, 3.0)

print 'Classes proportion\n', train_labels_3.value_counts()
train_learning_curve(train_data_3, train_labels_3, default_params=default_params)


Classes proportion
0    27768
1     8928
Name: labels, dtype: int64
Best cross-validation result (0.75) obtained for 29356 

Compare with results of cross-validation.


In [139]:
cv_results = cross_validation(train_data_3, train_labels_3, num_rounds=23)
cv_results.tail(5)


Will train until cv error hasn't decreased in 5 rounds.
Out[139]:
test-auc-mean test-auc-std train-auc-mean train-auc-std
18 0.759443 0.007300 0.764671 0.000656
19 0.760227 0.006986 0.765460 0.001107
20 0.760787 0.006898 0.766162 0.001274
21 0.761417 0.006786 0.767002 0.001388
22 0.761910 0.006949 0.768039 0.001266

Increase 6 times the number of classes of label "1"


In [39]:
train_data_6, train_labels_6 = oversample_1(X_train, y_train, 6.0)

print 'Classes proportion\n', train_labels_6.value_counts()
train_learning_curve(train_data_6, train_labels_6, default_params=default_params)


Classes proportion
0    27768
1    15624
Name: labels, dtype: int64
Best cross-validation result (0.75) obtained for 34713 

Compare with results of cross-validation.


In [40]:
cv_results = cross_validation(train_data_6, train_labels_6, num_rounds=23)
cv_results.tail(5)


Will train until cv error hasn't decreased in 5 rounds.
Out[40]:
test-auc-mean test-auc-std train-auc-mean train-auc-std
18 0.753811 0.008476 0.758837 0.000831
19 0.754428 0.008455 0.759629 0.001028
20 0.755277 0.008282 0.760924 0.001057
21 0.755780 0.008285 0.761712 0.001045
22 0.756787 0.008062 0.762790 0.001044

Raze the ratio of classes labeled "1" and "0"


In [41]:
train_data_11, train_labels_11 = oversample_1(X_train, y_train, 11.2)

print u'Classes proportion\n', train_labels_11.value_counts()
train_learning_curve(train_data_11, train_labels_11, default_params=default_params)


Classes proportion
0    27768
1    27230
Name: labels, dtype: int64
Best cross-validation result (0.75) obtained for 34098 

Compare with results of cross-validation.


In [42]:
cv_results = cross_validation(train_data_11, train_labels_11, num_rounds=23)
cv_results.tail(5)


Will train until cv error hasn't decreased in 5 rounds.
Out[42]:
test-auc-mean test-auc-std train-auc-mean train-auc-std
18 0.754425 0.002227 0.757008 0.000846
19 0.755627 0.002036 0.758210 0.000750
20 0.756050 0.002195 0.758955 0.000589
21 0.756728 0.001851 0.759543 0.000653
22 0.757621 0.001969 0.760645 0.000541

Compare with results on hold-out dataset.


In [54]:
params = {'objective': 'binary:logistic',
                  'max_depth': 2,
                  'learning_rate': 0.3,
                  'silent': 1,
                  'n_estimators': 23,
                  'scale_pos_weight':float(sum(y_train == 0))/sum(y_train == 1),
                  'seed': 123 }

train_data_11, train_labels_11 = oversample_1(X_train, y_train, 11.0)

gtree = xgb.XGBClassifier(**params)
gtree.fit(train_data_11, train_labels_11, eval_metric='auc')

print 'Hold-out set, roc-auc: ', roc_auc_score(y_hold_out, gtree.predict_proba(X_hold_out)[:, 1])


Hold-out set, roc-auc:  0.716084190667

Conclusions:

As can be seen from the graphs of the cross-validation and cross-validation, the more data and the more balanced the sample is, the higher the score roc_auc, but the score on the hold-out sample is not so optimistic and makes it clear that increasing the number of the same data of a smaller class may lead to overfitting. Why is this happening? Maybe it's the fact that I have enable the option of balancing classes in XGBoost and because of this, the saturation of the sample the same data leads to overfitting. Switching on and off 'scale_pos_weight' parameter of XGBoost confirms that assumption.

3.3 The elimination of the imbalance of classes by reducing the number of objects larger class

Apply undersampling technique to sampling: for this you need to remove number of objects of larger class from the sample so that the classes proportion has changed.

Looking ahead I will say that after a little research I found out that the model is most mistaken for the objects, which had a lot of blanks. Moreover, such objects are among the class of "0". I think it will be convenient to reduce the number of objects larger class where there is likelyhood to make a mistake when you restore blanks.


In [55]:
def get_fresh_data():    
    fresh_data = pd.read_csv('orange_small_churn_train_data.csv', index_col='ID')
    fresh_data.loc[fresh_data['labels'] == -1, 'labels'] = 0   
    return drop_empty(fresh_data)

def standart_fill_nans(data):
    df_new = data.copy(deep=True)
    num_features = df_new.select_dtypes(include=['float64', 'int64']).columns
    cat_features = df_new.select_dtypes(include=['object']).columns

    df_new[num_features] = df_new[num_features].fillna(0)
    df_new[cat_features] = df_new[cat_features].fillna('NaN').apply(lambda x: LabelEncoder().fit_transform(x))
    return df_new

In [56]:
def undersample_0(amount_nans=50): 
    new_df = get_fresh_data()
    train_data, train_labels = new_df.iloc[:, :-1], new_df.iloc[:,-1]
    
    labels_0_idx = train_labels[train_labels == 0].index
    to_drop_idxs = []
    
    for row in train_data.ix[labels_0_idx].iterrows():
        if row[1].isnull().sum() > amount_nans:
            to_drop_idxs.append(row[0])

    train_data, train_labels = train_data.drop(train_data.index[to_drop_idxs]), train_labels.drop(train_labels.index[to_drop_idxs])       
    return train_data, train_labels

Reduce the number of classes labeled "0" for objects with the number of blanks in the features of more than 40.


In [57]:
train_data_und_40, train_labels_und_40 = undersample_0(40)
train_data_und_40 = standart_fill_nans(train_data_und_40)

print u'Classes proportion\n', train_labels_und_40.value_counts()
train_learning_curve(train_data_und_40, train_labels_und_40, default_params=default_params)


Classes proportion
0    33195
1     2976
Name: labels, dtype: int64
Best cross-validation result (0.75) obtained for 22425 

Compare with results of cross-validation.


In [58]:
cv_results = cross_validation(train_data_und_40, train_labels_und_40, num_rounds=23)
cv_results.tail(5)


Will train until cv error hasn't decreased in 5 rounds.
Out[58]:
test-auc-mean test-auc-std train-auc-mean train-auc-std
18 0.746587 0.007425 0.762132 0.001623
19 0.746736 0.007431 0.763265 0.001131
20 0.747840 0.006784 0.764294 0.001426
21 0.748618 0.006411 0.764855 0.001723
22 0.748127 0.006280 0.765758 0.001710

Reduce the number of classes labeled "0" for objects with the number of blanks in the features of more than 10.


In [59]:
train_data_und_10, train_labels_und_10 = undersample_0(10)
train_data_und_10 = standart_fill_nans(train_data_und_10)

print u'Classes proportion\n', train_labels_und_10.value_counts()
train_learning_curve(train_data_und_10, train_labels_und_10, default_params=default_params)


Classes proportion
0    32195
1     2976
Name: labels, dtype: int64
Best cross-validation result (0.76) obtained for 21805 

Compare with results of cross-validation.


In [60]:
cv_results = cross_validation(train_data_und_10, train_labels_und_10, num_rounds=23)
cv_results.tail(5)


Will train until cv error hasn't decreased in 5 rounds.
Out[60]:
test-auc-mean test-auc-std train-auc-mean train-auc-std
18 0.756499 0.008787 0.770654 0.001640
19 0.756922 0.009371 0.771690 0.001081
20 0.757258 0.009643 0.772568 0.001432
21 0.757905 0.009666 0.773301 0.001576
22 0.758529 0.009201 0.774593 0.001789

Compare with results on hold-out dataset.


In [169]:
params = {'objective': 'binary:logistic',
                  'max_depth': 2,
                  'learning_rate': 0.3,
                  'silent': 1,
                  'n_estimators': 23,
                  'seed': seed }

train_data_ho, train_labels_ho = undersample_0(10)
train_data_ho = standart_fill_nans(train_data_ho)

gtree = xgb.XGBClassifier(**params)
gtree.fit(train_data_ho, train_labels_ho, eval_metric='auc')

print 'Hold-out set, roc-auc: ', roc_auc_score(y_hold_out, gtree.predict_proba(X_hold_out)[:, 1])


Hold-out set, roc-auc:  0.673252412268

Conclusions:

Hold-out sample is put everything in its place. When reducing the number of blank objects, the greater the class quality decreases

3.4 Blanks-filling for object separately by class

Let's try to fill in blanks in the classes "0" and "1" separately. My hypothesis is that different classes have different distributions in features and to restore such distributions is necessary separately


In [170]:
features_sample = ['Var6', 'Var7', 'Var149', 'Var109']
fig, axes = plt.subplots(1, 4, figsize=(20, 5))
for i in xrange(4):
    sns.kdeplot(X_train.loc[y_train == 0, features_sample[i]], color = 'r', ax=axes[i], label=0)
    sns.kdeplot(X_train.loc[y_train == 1, features_sample[i]], color = 'b', ax=axes[i], label=1)
    axes[i].set(xlabel=features_sample[i])
fig.tight_layout()


So, the graphs of the density distribution for randomly chosen features can be seen that even for balanced sampling density can vary. Let us restore the data separately by classes

Fill-in the numerical features blanks by average values


In [174]:
new_df  = get_fresh_data()
train_data, train_labels = new_df.iloc[:, :-1], new_df.iloc[:,-1]

data_label_0 = train_data[train_labels==0]
data_label_1 = train_data[train_labels==1]

train_data[train_labels==0] = data_label_0[num_features].fillna(data_label_0[num_features].mean())
train_data[train_labels==1] = data_label_1[num_features].fillna(data_label_1[num_features].mean())

train_data[cat_features] = train_data[cat_features].fillna('NaN').apply(lambda x: LabelEncoder().fit_transform(x))

print 'Any values missing? ', train_data.isnull().values.any()
print 'Classes proportion\n', train_labels.value_counts()


Any values missing?  False
Class balance
0    37024
1     2976
Name: labels, dtype: int64

In [175]:
cv_results = cross_validation(train_data, train_labels, num_rounds=23)
cv_results.tail(5)


Will train until cv error hasn't decreased in 5 rounds.
Out[175]:
test-auc-mean test-auc-std train-auc-mean train-auc-std
18 0.991810 0.001671 0.992377 0.000216
19 0.991796 0.001826 0.992519 0.000232
20 0.991969 0.001746 0.992694 0.000326
21 0.992225 0.001524 0.992955 0.000337
22 0.992254 0.001470 0.993027 0.000316

Scale out numerical features. On one hand, scaling is very useful when working with linear models, and when working with trees may not help, but let's look at the result

Scale-out by RobustScaler, which can help if values in features have large emissions.


In [178]:
new_df = get_fresh_data()
train_data, train_labels = new_df.iloc[:, :-1], new_df.iloc[:,-1]

data_label_0 = train_data[train_labels==0]
data_label_1 = train_data[train_labels==1]

train_data[train_labels==0] = data_label_0[num_features].fillna(0).apply(RobustScaler().fit_transform)
train_data[train_labels==1] = data_label_1[num_features].fillna(0).apply(RobustScaler().fit_transform)

train_data[cat_features] = train_data[cat_features].fillna('NaN').apply(lambda x: LabelEncoder().fit_transform(x))

print 'Any values missing? ', train_data.isnull().values.any()
print 'Classes proportion\n', train_labels.value_counts()


Any values missing?  False
Class balance
0    37024
1     2976
Name: labels, dtype: int64

Cross-validation check.


In [179]:
cv_results = cross_validation(train_data, train_labels, num_rounds=23)
cv_results.tail(5)


Will train until cv error hasn't decreased in 5 rounds.
Out[179]:
test-auc-mean test-auc-std train-auc-mean train-auc-std
18 0.999819 0.000087 0.999876 0.000031
19 0.999860 0.000067 0.999896 0.000020
20 0.999865 0.000068 0.999904 0.000018
21 0.999881 0.000062 0.999917 0.000017
22 0.999880 0.000054 0.999920 0.000016

Scale-out by StandardScaler, transforming data to a standard distribution


In [181]:
new_df = get_fresh_data()
train_data, train_labels = new_df.iloc[:, :-1], new_df.iloc[:,-1]

data_label_0 = train_data[train_labels==0]
data_label_1 = train_data[train_labels==1]

train_data[train_labels==0] = data_label_0[num_features].fillna(0).apply(StandardScaler().fit_transform)
train_data[train_labels==1] = data_label_1[num_features].fillna(0).apply(StandardScaler().fit_transform)

train_data[cat_features] = train_data[cat_features].fillna('NaN').apply(lambda x: LabelEncoder().fit_transform(x))

print 'Any values missing? ', train_data.isnull().values.any()
print 'Classes proportion\n', train_labels.value_counts()


Any values missing?  False
Class balance
0    37024
1     2976
Name: labels, dtype: int64

Cross-validation check.


In [182]:
cv_results = cross_validation(train_data, train_labels, num_rounds=23)
cv_results.tail(5)


Will train until cv error hasn't decreased in 5 rounds.
Stopping. Best iteration:
[0] cv-mean:0.9994844	cv-std:0.000682405627175
Out[182]:
test-auc-mean test-auc-std train-auc-mean train-auc-std
0 0.999484 0.000682 0.999832 0.000084

As we can see, each of the tested methods may lead to overfitting of the model. We assume that the recovery by class not working. Confirm this fact with testing on hold-out sample


In [209]:
new_df = get_fresh_data()
data, labels = new_df.iloc[:, :-1], new_df.iloc[:,-1]

num_features = data.select_dtypes(include=['float64', 'int64']).columns
cat_features = data.select_dtypes(include=['object']).columns

data[cat_features] = data[cat_features].fillna('NaN').apply(lambda x: LabelEncoder().fit_transform(x))

X_train, X_hold_out, y_train, y_hold_out = train_test_split(data, labels, test_size=0.25, 
                                                     random_state=seed, stratify=labels)

data_label_0 = X_train.loc[y_train==0, num_features]
data_label_1 = X_train.loc[y_train==1, num_features]

data_label_0 = data_label_0.fillna(data_label_0.mean())
data_label_1 = data_label_1.fillna(data_label_1.mean())

X_hold_out[num_features] = X_hold_out[num_features].fillna(X_hold_out[num_features].mean())

print 'Any values missing? ', X_hold_out.isnull().values.any()

params = {'objective': 'binary:logistic',
                  'max_depth': 2,
                  'learning_rate': 0.3,
                  'silent': 1,
                  'n_estimators': 23,
                  'scale_pos_weight':float(sum(y_train == 0))/sum(y_train == 1),
                  'seed': seed }

gtree = xgb.XGBClassifier(**params)
gtree.fit(X_train, y_train, eval_metric='auc')

print 'Hold-out set, roc-auc: ', roc_auc_score(y_hold_out, gtree.predict_proba(X_hold_out)[:, 1])


Any values missing?  False
Hold-out set, roc-auc:  0.719201755792

Conclusions:

Testing has shown that it is best to fill in the blanks with zeros. This is not surprising because if you look at the mode of our data, it is easy to see that the majority of numerical features the number with the highest frequency is number zero. However, the recovery features's blanks of the individual classes was not justified.

Scaling of numerical features has had little influence on the result.

3.5 Comparison of strategies for encoding categorical features

Before experiments with categorical features, let's look at their density distribution. This will help us to form hypotheses how to work with them and maybe it will become clear what categorical features better get rid of


In [225]:
fig, axes = plt.subplots(8, 4, figsize=(20, 30))
for i in xrange(len(cat_features)):
    sns.distplot(data.loc[labels == 0, cat_features[i]], bins=30, kde=False, ax=axes[i / 4, i % 4], label='0')
    sns.distplot(data.loc[labels == 1, cat_features[i]], bins=30, kde=False, ax=axes[i / 4, i % 4], label='1')
    axes[i / 4, i % 4].set(xlabel=cat_features[i])
fig.tight_layout()


Estimate values frequencies in categorical features.


In [232]:
df[cat_features].describe()


Out[232]:
Var192 Var193 Var195 Var196 Var197 Var198 Var199 Var200 Var202 Var203 ... Var219 Var220 Var221 Var222 Var223 Var225 Var226 Var227 Var228 Var229
count 39709 40000 40000 40000 39877 40000 39996 19634 39999 39877 ... 35804 40000 40000 40000 35804 19065 40000 40000 40000 17223
unique 354 50 23 4 220 3891 4400 13297 5542 5 ... 22 3891 7 3891 4 3 23 7 30 4
top 1GdOj1KXzC RO12 taul 1K8T 0Xwj fhk21Ss r83_sZi yP09M03 nyZz 9_Y1 ... FzaX 4UxGlow oslk catzS2D LM8l689qOp ELof FSa2 RAYp F2FyR07IdsN7I am7c
freq 313 28817 38353 39633 3688 3557 746 51 156 36192 ... 32191 3557 29610 3557 29279 8875 6403 28112 26211 9312

4 rows × 32 columns

As we can see the number of unique values in some categories more than a few thousand: Var198, Var200, Var220, Var222 and others. Such number of unique values is not very helpfull in calculating the class to which the object belongs to and looks more like noise to be rid of. Remove features with number of categories greater than 30.

Code categorially features by using binarization, we introduce a separate feature-marker for empty values


In [15]:
def get_cat_bin_data():
    new_df = get_fresh_data()
    train_data, train_labels = new_df.iloc[:, :-1], new_df.iloc[:,-1]

    num_features = train_data.select_dtypes(include=['float64', 'int64']).columns
    #feature-marker for empty values
    num_features_na = list(map(lambda name: name+'_na', num_features))
    
    train_data[num_features_na] = train_data[num_features].apply(lambda x: pd.isnull(x).astype(int))
    train_data[num_features] = train_data[num_features].fillna(0)

    cat_features = train_data.select_dtypes(include=['object']).columns
    
    to_drop = []
    for cat_col in cat_features:
        uniques = len(train_data[cat_col].unique())
        if uniques > 30:
            to_drop.append(cat_col)

    train_data = train_data.drop(to_drop, axis=1)
    cat_features = [col for col in cat_features if col not in to_drop] 
    
    # Binarizing of cat.+ na
    dummies_cat = pd.get_dummies(train_data[cat_features], dummy_na=True)

    train_data = train_data.drop(cat_features, axis=1)    
    train_data = pd.concat([train_data, dummies_cat], ignore_index=False, axis=1)

    return train_data, train_labels

In [64]:
train_data, train_labels = get_cat_bin_data()

print 'New shape:', train_data.shape

X_train, X_valid, y_train, y_valid = train_test_split(train_data, train_labels, test_size=0.25, 
                                                     random_state=seed, stratify=train_labels) 

print 'Any missing values? ', X_train.isnull().values.any()
print 'Classes proportion\n', y_train.value_counts()

cv_results = cross_validation(X_train, y_train, num_rounds=23)
cv_results.tail(5)


 New shape: (40000, 284)
Any missing values?  False
Classes proportion
0    27768
1     2232
Name: labels, dtype: int64
Will train until cv error hasn't decreased in 5 rounds.
Out[64]:
test-auc-mean test-auc-std train-auc-mean train-auc-std
18 0.740105 0.011056 0.759684 0.003631
19 0.739834 0.010869 0.760454 0.003289
20 0.739806 0.010983 0.761435 0.003644
21 0.739880 0.010625 0.762803 0.003802
22 0.741283 0.011387 0.763909 0.003639

The results on the hold-out sample with dummy coding quite a bit has fallen in comparison with LabelEncoder, but this result still does not figure. Things can change after selection of the parameters and setting the model, because now all the experiments going on basline


In [65]:
params = {'objective': 'binary:logistic',
                  'max_depth': 2,
                  'learning_rate': 0.3,
                  'silent': 1,
                  'n_estimators': 23,
                  'scale_pos_weight':float(sum(y_train == 0))/sum(y_train == 1),
                  'seed': seed }


gtree = xgb.XGBClassifier(**params)
gtree.fit(X_train, y_train, eval_metric='auc')

print 'Hold-out set, roc-auc: ', roc_auc_score(y_valid, gtree.predict_proba(X_valid)[:, 1])


Hold-out set, roc-auc:  0.723218330917

4. Features analysis, selection of features

All features have been useful to build models? Conduct the procedure of selecting features, try different options of selection.

For features selecting I will use several approaches. The first is already built into XGBoost is the ability to look at the importance of the features by the number of times when he was involved in the split. Second, is the selection K-best of features by Mutual Information or the Fisher test. The third is a greedy selection of features based on the model LogisticRegression.

Selection by XGBoost


In [66]:
params = {'objective': 'binary:logistic',
                  'max_depth': 2,
                  'learning_rate': 0.3,
                  'silent': 1,
                  'n_estimators': 23,
                  'scale_pos_weight':float(sum(y_train == 0))/sum(y_train == 1),
                  'eval_metric':'auc',
                  'seed': seed }

dtrain = xgb.DMatrix(data=X_train, label=y_train, silent=False, feature_names=X_train.columns)
dtest = xgb.DMatrix(data=X_valid, label=y_valid, silent=False, feature_names=X_valid.columns)

watchlist  = [(dtest,'test'), (dtrain,'train')] 
# bst = xgb.train(params, dtrain, 40, watchlist)
bst = xgb.train(params, dtrain, 40)

importances = bst.get_fscore()
importance_df = pd.DataFrame({
        'Splits': list(importances.values()),
        'Feature': list(importances.keys())
    })
importance_df.sort_values(by='Splits', inplace=True)
importance_df.plot(kind='barh', x='Feature', figsize=(8,10), color='orange')
print "\nOptimal number of features : %d" % len(importances.values())


Optimal number of features : 40

Memorize the most important signs, that will come in handy in the future


In [67]:
k_v = zip(importances.values(), importances.keys())
k_v.sort(key = lambda t: t[0], reverse=True)
kv_importances = [v[1] for v in k_v]
kv_importances


Out[67]:
['Var126',
 'Var113',
 'Var81',
 'Var189',
 'Var218_nan',
 'Var74',
 'Var210_g5HH',
 'Var73',
 'Var28',
 'Var6',
 'Var221_d0EEeJi',
 'Var163',
 'Var218_cJvF',
 'Var13',
 'Var153',
 'Var123',
 'Var205_sJzTlal',
 'Var76',
 'Var133',
 'Var140',
 'Var38',
 'Var57',
 'Var229_nan',
 'Var65',
 'Var112',
 'Var226_FSa2',
 'Var85',
 'Var219_FzaX',
 'Var83',
 'Var203_9_Y1',
 'Var221_JIiEFBU',
 'Var72',
 'Var228_F2FyR07IdsN7I',
 'Var160',
 'Var226_me1d',
 'Var125',
 'Var228_iyHGyLCEkQ',
 'Var144',
 'Var126_na',
 'Var228_d0LtHjWeaXyArdN4sxU_saXqH']

In [20]:
params = {'objective': 'binary:logistic',
                  'max_depth': 2,
                  'learning_rate': 0.3,
                  'silent': 1,
                  'n_estimators': 23,
                  'seed': seed }

gtree = xgb.XGBClassifier(**params)
gtree.fit(X_train[kv_importances], y_train, eval_metric='auc')

print 'Hold-out set, roc-auc: ', roc_auc_score(y_valid, gtree.predict_proba(X_valid[kv_importances])[:, 1])


Hold-out set, roc-auc:  0.721969794658

K-best selection of features by Mutual Information or the Fisher test


In [489]:
params = {
'objective': 'binary:logistic', 
'max_depth': 2,
'learning_rate': 0.3,
'silent': True,
# 'scale_pos_weight':float(sum(y_train == 0))/sum(y_train == 1),
'n_estimators':23,
'seed': seed
}


pipeline = Pipeline([
('selector', SelectKBest()),
('boost', xgb.XGBClassifier(**params))
])


parameters_grid = {
    'selector__score_func': [f_classif, mutual_info_classif], 
    'selector__k': [90, 100, 120]
}

grid_cv = GridSearchCV(pipeline, parameters_grid, scoring='roc_auc', n_jobs=-1, 
                                                    cv = StratifiedShuffleSplit(n_splits=5,test_size=0.25,random_state=seed))

grid_cv.fit(X_train, y_train)
print u'Best AUC score: ', grid_cv.best_score_
print u'Best params: ',grid_cv.best_params_


Best AUC score:  0.741320351215
Best params:  {'selector__k': 90, 'selector__score_func': <function f_classif at 0x7f1a08593668>}

Hold-out sample estimation


In [23]:
params = {'objective': 'binary:logistic',
                  'max_depth': 2,
                  'learning_rate': 0.3,
                  'silent': True,
                  'n_estimators': 23,
                  'scale_pos_weight':float(sum(y_train == 0))/sum(y_train == 1),
                  'seed': seed }

kbest = SelectKBest(k=90, score_func=f_classif) 
X_train_kbest = kbest.fit_transform(X_train, y_train)
X_valid_kbest = kbest.transform(X_valid)

gtree = xgb.XGBClassifier(**params)
gtree.fit(X_train_kbest, y_train, eval_metric='auc')

print 'Hold-out set, roc-auc: ', roc_auc_score(y_valid, gtree.predict_proba(X_valid_kbest)[:, 1])


Hold-out set, roc-auc:  0.724746836693

Finally, we will conduct a greedy selection of features with help of LogisticRegression


In [70]:
from sklearn.linear_model import LogisticRegression

X_train_scaled = X_train
X_train_scaled[num_features] = RobustScaler().fit_transform(X_train[num_features])

rfecv = RFECV(estimator=LogisticRegression(class_weight='balanced'), 
              step=1, 
              cv=StratifiedShuffleSplit(n_splits=5,test_size=0.3,random_state=seed),
              scoring='roc_auc', 
              n_jobs=-1)
rfecv.fit(X_train_scaled, y_train)

# Plot number of features VS. cross-validation scores
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (nb of correct classifications)")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()

print("Optimal number of features : %d" % rfecv.n_features_)


Optimal number of features : 169

In [71]:
selected_idxes = np.where( rfecv.support_ == True )
rfe_selected = X_train.columns.values[selected_idxes]
print rfe_selected


['Var7' 'Var21' 'Var22' 'Var73' 'Var113' 'Var126' 'Var144' 'Var189'
 'Var6_na' 'Var24_na' 'Var28_na' 'Var38_na' 'Var65_na' 'Var81_na'
 'Var109_na' 'Var119_na' 'Var126_na' 'Var140_na' 'Var144_na' 'Var149_na'
 'Var189_na' 'Var195_ArtjQZ8ftr3NB' 'Var195_ArtjQZQO1r9fC'
 'Var195_ArtjQZmIvr94p' 'Var195_BNjsq81k1tWAYigY' 'Var195_CiJDdr4TQ0rGERIS'
 'Var195_CuXi4je' 'Var195_F1JQrEL' 'Var195_I9xt3GBDKUbd8'
 'Var195_I9xt3GDRhUK7p' 'Var195_I9xt3GMcxUnBZ' 'Var195_I9xt3Gi01UK7p'
 'Var195_LfvqpCtLOY' 'Var195_TnJpfvsJgF' 'Var195_XMIgoIlPqx'
 'Var195_ZZBPiZh' 'Var195_b_3Q' 'Var195_bsZtYxFjzA' 'Var195_ev6I'
 'Var195_hiMqnEM7VgIk4JUu' 'Var195_lSbpiq1' 'Var195_taul' 'Var196_1K8T'
 'Var196_JA1C' 'Var203_HLqf' 'Var203_dgxZ' 'Var203_nan' 'Var205_09_Q'
 'Var205_VpdQ' 'Var205_sJzTlal' 'Var206_409L' 'Var206_69fI' 'Var206_6JmL'
 'Var206_CoYW' 'Var206_G_zk' 'Var206_TIA9' 'Var206_giwq' 'Var206_hzlB'
 'Var206_itlM' 'Var206_lVqb' 'Var206_oZyB' 'Var206_sYC_' 'Var206_wMei'
 'Var206_y6dw' 'Var206_nan' 'Var207_0MCPoln' 'Var207_15TtzZrRt2'
 'Var207_5iay' 'Var207_6C53VA1kCv' 'Var207_DHn_WUyBhW_whjA88g9bvA64_'
 'Var207_EBKcR3s6B22tD6gC36gm6S' 'Var207_GjJ35utlTa_GNSvxxpb9ju'
 'Var207_Kxdu' 'Var207_NKv3VA1BpP' 'Var207_me75fM6ugJ'
 'Var207_o0ZZAVSQ32YuE' 'Var207_tMBVJkA0xJMEATvl4ht' 'Var207_wXfldy7'
 'Var208_kIsH' 'Var208_sBgB' 'Var208_nan' 'Var210_3av_' 'Var210_7A3j'
 'Var210_DM_V' 'Var210_g5HH' 'Var210_oT7d' 'Var210_uKAI' 'Var211_L84s'
 'Var211_Mtgm' 'Var218_UYBR' 'Var218_cJvF' 'Var218_nan' 'Var219_49W0rUY'
 'Var219_6krWwfF' 'Var219_AT1N' 'Var219_AU8KzzF' 'Var219_AU8OvAe'
 'Var219_AU8_WTd' 'Var219_AU8ltHK' 'Var219_AU8pNoi' 'Var219_FQHxeR8'
 'Var219_FqMWi1g' 'Var219_FzaX' 'Var219_JdAM' 'Var219_Lmli' 'Var219_OFWH'
 'Var219_kgEg' 'Var219_qxDb' 'Var219_wwPEXoilkr' 'Var219_ylgWTXl'
 'Var219_nan' 'Var221_Al6ZaUT' 'Var221_JIiEFBU' 'Var221_QKW8DRm'
 'Var221_d0EEeJi' 'Var221_oslk' 'Var221_z4pH' 'Var221_zCkv'
 'Var223_bCPvVye' 'Var223_nan' 'Var225_nan' 'Var226_5Acm' 'Var226_7FJQ'
 'Var226_7P5s' 'Var226_7aLG' 'Var226_Aoh3' 'Var226_FSa2' 'Var226_PM2D'
 'Var226_Qu4f' 'Var226_TNEC' 'Var226_Xa3G' 'Var226_me1d' 'Var226_rgKb'
 'Var226_szEZ' 'Var226_uWr3' 'Var226_w_Ub' 'Var226_xb3V' 'Var227_02N6s8f'
 'Var227_6fzt' 'Var227_RAYp' 'Var227_ZI9m' 'Var227_nIGjgSB'
 'Var227_vJ_w8kB' 'Var228_0pzBWGkV3fbsGZN52DH' 'Var228_4rd_'
 'Var228_55YFVY9' 'Var228_5oZS' 'Var228_6nxx00q6gnIF8ePJ3P3AEfZ1N'
 'Var228_9VmiOykV3fbsAg65e4w' 'Var228_DU8B' 'Var228_JnGLH58smNxIYcGK39k'
 'Var228_NoEd' 'Var228_R4y5gQQWY8OodqDV' 'Var228_RjBDiL8f9CQYc21fICscNNpj6'
 'Var228_SbOd7O8ky1wGNxp0Arj0Xs' 'Var228_VjDE' 'Var228_WfJ2BB2SFSqauljlfOB'
 'Var228_ZeaF' 'Var228_Zy3gnGM' 'Var228__URKn_77G3QyQHSVQ2N1RKmtW'
 'Var228__cTCyH95OE93jSkoIBT' 'Var228_b9qbUNk0dML_Mvi2'
 'Var228_d0LtHjWeaXyArdN4sxU_saXqH' 'Var228_iyHGyLCEkQ' 'Var228_r_7E'
 'Var228_xwM2aC7IdeMC0' 'Var229_oJmt' 'Var229_sk2h' 'Var229_nan']

In [74]:
params = {'objective': 'binary:logistic',
                  'max_depth': 2,
                  'learning_rate': 0.3,
                  'silent': True,
                  'n_estimators': 23,
                  'scale_pos_weight':float(sum(y_train == 0))/sum(y_train == 1),
                  'seed': seed }

gtree = xgb.XGBClassifier(**params)
gtree.fit(X_train[rfe_selected], y_train, eval_metric='auc')

print 'Hold-out set, roc-auc: ', roc_auc_score(y_valid, gtree.predict_proba(X_valid[rfe_selected])[:, 1])


Hold-out set, roc-auc:  0.5685238462

Conclusions: So, different approaches have selected different number of features and showed different results on the hold-out sample. Greedy selection, 169, auc: 0.5685, selection by splits in the XGBoost - 40, auc: 0.7219. SelectKBest using Fisher 90, auc: 0.7247.

5. Model customizing

Select the optimal model parameters. Note that depending on how you processed the raw data, did the class balancing, how many objects are left in the training set, the optimal values of the parameters can be changed. Take the best of decisions at the moment, and will carry out the procedure of selection of parameters of the model using hyperopt.


In [26]:
from hyperopt import hp
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials


def score(params):
    xgb_tree = xgb.XGBClassifier(**params)
    
    eval_set = [(X_train, y_train), (X_valid, y_valid)] 

    xgb_tree.fit(X_train, y_train,
            eval_set=eval_set, eval_metric="auc", 
            early_stopping_rounds=30, verbose=False)

    pred = xgb_tree.predict_proba(X_valid)[:,1]
    auc = roc_auc_score(y_valid, pred)
    
    return{'loss':1-auc, 'status': STATUS_OK }

def optimize(trials):
    space = {
             'n_estimators' : hp.choice('n_estimators', np.arange(10, 100, dtype=int)),
             'learning_rate' : hp.quniform('learning_rate', 0.025, 0.5, 0.025),
             'max_depth' : hp.choice('max_depth', np.arange(2, 8, dtype=int)),
             'min_child_weight' : hp.quniform('min_child_weight', 1, 6, 1),
             'subsample' : hp.quniform('subsample', 0.3, 0.8, 0.05),
             'gamma' : hp.quniform('gamma', 0.5, 1, 0.05),
             'colsample_bytree' : hp.quniform('colsample_bytree', 0.3, 0.8, 0.05),
             'objective': 'binary:logistic',
             'nthread' : 7,
             'silent' : 1,
             'scale_pos_weight':float(sum(y_train == 0))/sum(y_train == 1),   
             'seed': seed   
             }

    best = fmin(score, space, algo=tpe.suggest, trials=trials, max_evals=500)

    print best

trials = Trials()
optimize(trials)


{'colsample_bytree': 0.55, 'learning_rate': 0.1, 'min_child_weight': 6.0, 'n_estimators': 59, 'subsample': 0.65, 'max_depth': 2, 'gamma': 0.9500000000000001}

Set the new, optimized default settings, test on hold-out sample our optimization is not made worse.


In [73]:
params = {
    'objective': 'binary:logistic',
    'colsample_bytree': 0.55, 
    'learning_rate': 0.1, 
    'min_child_weight': 6.0, 
    'n_estimators': 59, 
    'subsample': 0.65, 
    'max_depth': 2, 
    'gamma': 0.95,
    'scale_pos_weight':12.44,
    'seed':seed
}
gtree = xgb.XGBClassifier(**params)
gtree.fit(X_train, y_train, eval_metric='auc')
predict_prb = gtree.predict_proba(X_valid)[:, 1]

print 'Hold-out set, roc-auc: ', roc_auc_score(y_valid, predict_prb)


Hold-out set, roc-auc:  0.618123466557

As you can see to find the optimal settings with hyperopt on 300 iterations failed. You may want to increase the number of iterations to narrow down the number of parameters, but this time we are not going to do it

5.1 Investigation of objects that achieved the highest classification error

Let's look at the objects. On what objects achieved the highest classification error? Are these objects have something in common? Can you see any patterns? Why the largest error is achieved at these facilities? In this case, "largest" error can be understood as the classification of an object with someone else's class with great certainty (high probability).


In [75]:
gtree = xgb.XGBClassifier(**params)
gtree.fit(X_train, y_train, eval_metric='auc')
predict_prb = gtree.predict_proba(X_valid)[:, 1]

df_new = pd.concat([X_valid])

df_new['target'] = y_valid
df_new['prediction'] = pd.Series(gtree.predict(X_valid), index=X_valid.index)
df_new['prediction_proba'] = pd.Series(predict_prb, index=X_valid.index) 

df_new.head()


Out[75]:
Var6 Var7 Var13 Var21 Var22 Var24 Var25 Var28 Var35 Var38 ... Var228_xwM2aC7IdeMC0 Var228_nan Var229_am7c Var229_mj86 Var229_oJmt Var229_sk2h Var229_nan target prediction prediction_proba
ID
10619 868.0 7.0 396.0 88.0 110.0 2.0 72.0 166.56 0.0 0.0 ... 0 0 0 1 0 0 0 1 0 0.359385
24896 1155.0 7.0 2176.0 256.0 320.0 2.0 80.0 452.40 0.0 3063684.0 ... 0 0 0 0 0 0 1 0 0 0.310083
6597 189.0 0.0 0.0 20.0 25.0 0.0 0.0 186.64 0.0 722520.0 ... 0 0 0 0 0 0 1 0 0 0.446431
39086 875.0 0.0 0.0 64.0 80.0 2.0 8.0 286.96 5.0 8694960.0 ... 0 0 0 0 0 0 1 0 1 0.505361
3873 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 ... 0 0 0 0 0 0 1 0 0 0.133818

5 rows × 287 columns

Find the indices of objects which do not match the target label and the predicted label


In [840]:
not_same=df_new[((df_new.target == 1) & (df_new.prediction == 0)) | ((df_new.target == 0) & (df_new.prediction == 1))]

Investigate the behavior of the most important numerical features which have selected by the model. For this visualize the distribution of feature values. Blanks will be marked as it is noticeable, for example, by number -70


In [841]:
new_df, _ = get_fresh_data()
train_data, train_labels = new_df.iloc[:, :-1], new_df.iloc[:,-1]

num_features = ['Var126','Var113','Var81','Var73','Var189','Var74','Var133','Var6',
                 'Var57','Var65','Var28','Var94','Var140','Var163','Var13','Var153',
                 'Var83','Var119','Var123','Var38']

train_data[num_features] = train_data[num_features].fillna(-70)

fig, axes = plt.subplots(5, 4, figsize=(20, 20))
for i in range(len(num_features)):
    xx = train_data.ix[not_same.index][num_features[i]]
    sns.distplot(xx, kde=False,ax=axes[i / 4, i % 4], color = 'r', label='0')
    axes[i / 4, i % 4].set(xlabel=num_features[i])
fig.tight_layout()


Conclusions:

As can be seen from the distributions, the objects on which mistakes were made combine two things: 1. They have a lot of blanks 2. There are large emissions near zero.

6. Future improvements

Restoration of missing data:

It seems to me, and it was repeatedly confirmed in the course of research - good data is everything! The main problem I faced is the correct restoration of blanks and work with distortions in the data distributions. I have considered not all methods and I think we should try others, for example Bayesian methods.

Try other models:

I like XGBoost and that's why I chose it as a base model for building experiments. But this does not mean that other models are bad. My experiments with undersampling, oversampling, data recovery by class, gave weak results of the experiments but to push me to the idea that these techniques may work poorly on gradient boosting only. But may well prove themselves in the linear models.

Another example, which pushes me to think that you can try the linear model is an experiment with the binarize categorical features and pseudocategories numerical features. Such binarization works very well in LogisticRegression, but did not give a good gain in applying for XGBoost.

In my experiments, I didn't use ensembles of models. Very often, especially in competitions on Kaggle, the models with different predictive abilities collected into one ensemble can give a significant improve in prediction score.

Fine tuning of the model:

If you look at the histogram of the importance of the features by which XGBoost did the splits and the distribution of values in these features, it becomes obvious that XGBoost somewhere too adjusted to these imbalances and we have to help him to process such data. How? I haven't decided yet, XGBoost is an extremely powerful tool to build a Gradient Boosting, but I invested far too little time to study and understand the nuances of his work.

New features generation:

In my work I have generated new features mainly by binarizing category features and adding new feature as marker of blank value in the object. There are other approaches to the generation of new features, for example by using clustering, union of similar(correlated) features, features of frequency of the repetition of values in features and so on. All of above + the experiments with the different models can give an improvement of AUC.