** Artem Zaika: KDD Cup 2009 - Customer relationship prediction **

For the study data were taken from the competition KDD 2009 on Customer Relationship prediction

The task is to estimate the churn of customers, hence there are two target values to be predicted. The challenge is staged in phases to test the rapidity with which each team is able to produce results. A large number of variables (15,000) is made available for prediction. However, to engage participants having access to less computing power, a smaller version of the dataset with only 230 variables will be made available in the second part of the challenge.

Churn (wikipedia definition): Churn rate is also sometimes called attrition rate. It is one of two primary factors that determine the steady-state level of customers a business will support. In its broadest sense, churn rate is a measure of the number of individuals or items moving into or out of a collection over a specific period of time. The term is used in many contexts, but is most widely applied in business with respect to a contractual customer base. For instance, it is an important factor for any business with a subscriber-based service model, including mobile telephone networks and pay TV operators. The term is also used to refer to participant turnover in peer-to-peer networks.

The small dataset will be made available at the end of the fast challenge. Both training and test sets contain 50,000 examples. The data are split similarly for the small and large versions, but the samples are ordered differently within the training and within the test sets. Both small and large datasets have numerical and categorical variables. For the large dataset, the first 14,740 variables are numerical and the last 260 are categorical. For the small dataset, the first 190 variables are numerical and the last 40 are categorical. Toy target values are available only for practice purpose. The prediction of the toy target values will not be part of the final evaluation.

The performances are evaluated according to the arithmetic mean of the AUC for the hurn. This is what we call "Score".

My work will be divided into several parts:

- 2. The primary analysis of the features, identifying patterns in distributions
- 3. Working with features, processing of missing values
- 3.1 Evaluation of the quality of the model depending on the number of objects
- 3.2 The elimination of the imbalance of classes by increasing the number of objects of a smaller class
- 3.3 The elimination of the imbalance of classes by reducing the number of objects larger class
- 3.4 Blanks-filling for object separately by class
- 3.5 Comparison of strategies for encoding categorical features

- 4. Features analysis, selection of features
- 5. Model customizing
- 6. Future improvements

```
In [1]:
```import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import xgboost as xgb
from statsmodels.stats.weightstats import *
from statsmodels.stats.proportion import proportion_confint
from scipy import stats
from collections import defaultdict, Counter
import itertools
from sklearn.preprocessing import StandardScaler, RobustScaler, LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score
from sklearn.feature_selection import SelectKBest,RFECV, SelectFromModel
from sklearn.feature_selection import mutual_info_classif, f_classif
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.model_selection import validation_curve, learning_curve
from sklearn.model_selection import ShuffleSplit, StratifiedShuffleSplit, StratifiedKFold
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline
seed = 123
np.random.seed(seed)

```
In [2]:
```xgb.__version__

```
Out[2]:
```

** Сonduct primary data analysis. Look at the data validity and omissions**

```
In [3]:
```df = pd.read_csv('orange_small_churn_train_data.csv', index_col='ID')
df.head()

```
Out[3]:
```

** For convenience, replace the target variable with the label -1 (churn) on 0**

```
In [4]:
```df.loc[df['labels'] == -1, 'labels'] = 0

** As we see in the data a lot of missing values. There are also features, which have missed all the values. When recovery permits, it is important to have a reliable allocation, but if the blanks in features of more than 60%, then I think such a distribution can not be trusted and means to restore such signs are useless, because they introduce only noise in the model**

** Remove features of having gaps in more than 65% of cases**

```
In [5]:
```def drop_empty(df):
has_nan = df.isnull().sum()
empty_cols = has_nan[has_nan >= len(df) * 0.65]
to_drop = empty_cols.index.values.tolist()
df.drop(to_drop, axis=1, inplace = True)
return df
df = drop_empty(df)
df.head()

```
Out[5]:
```

```
In [6]:
```data, labels = df.iloc[:, :-1], df.iloc[:,-1]

** Calculate proporions of classes “churn” and “not churn”. **

```
In [29]:
```length = data.shape[0]
label_counts = labels.value_counts()
print 'Proporion of class “not churn”:', float(label_counts[0]) / length
print 'Proporion of class “churn”:', float(label_counts[1]) / length

```
```

** As we see, proportion of classes 1 is more than proportion of class 0, in almost 130 times. It's not good as for classification it is important that classes in the sample were balanced. Ie in our case the number of labels class 1 - 'not churn', the proportion must be equal to the number of tags of class 0 - 'churn'. If the balance is skewed toward larger class, models, especially linear, can be too retrained to detect class with higher proportion and ignore the small one.**

**Later on we will continue to conduct experiments to increase number of objects of smaller class and see how it will affect the model evaluation.**

```
In [8]:
```num_features = data.select_dtypes(include=['float64', 'int64']).columns
cat_features = data.select_dtypes(include=['object']).columns
data[num_features] = data[num_features].fillna(0)
data[cat_features] = data[cat_features].fillna('NaN')

```
In [9]:
```fig, axes = plt.subplots(11, 4, figsize=(20, 30))
for i in xrange(len(num_features)):
sns.kdeplot(data.loc[labels == 0, num_features[i]], color = 'r', ax=axes[i / 4, i % 4], label=0)
sns.kdeplot(data.loc[labels == 1, num_features[i]], color = 'b', ax=axes[i / 4, i % 4], label=1)
axes[i / 4, i % 4].set(xlabel=num_features[i])
fig.tight_layout()

```
```

```
In [10]:
```# check whether average is significantly different for both classes in each feature
pvals = []
for i, col_name in enumerate(num_features):
class0 = data.loc[labels == 0, col_name]
class1 = data.loc[labels == 1, col_name]
pvals.append(stats.mannwhitneyu(class0, class1)[1])
pvals_df = pd.DataFrame({'p_value': pvals}, index=num_features).sort_values('p_value', ascending=True)
# the number of variables that have statistically significantly different average
sum(pvals_df['p_value'] < 0.01)

```
Out[10]:
```

```
In [91]:
```top20 = pvals_df[:20].index.tolist()
pvals_df.head()

```
Out[91]:
```

```
In [92]:
```fig, axes = plt.subplots(2, 4, figsize=(16, 6))
for i in range(8):
sns.kdeplot(data.loc[labels == 0, pvals_df.index[i]], color = 'r', ax=axes[i / 4, i % 4], label=0)
sns.kdeplot(data.loc[labels == 1, pvals_df.index[i]], color = 'b', ax=axes[i / 4, i % 4], label=1)
axes[i / 4, i % 4].set(xlabel=pvals_df.index[i])
fig.tight_layout()

```
```

```
In [27]:
```print "Proportion of class churn", float(label_counts[1]) / length
print "Proportion of class churn for missed values of feature Var126:", np.round(labels[df.Var126.isnull()].mean(), 3)

```
```

```
In [28]:
```proportion = labels[df.Var126.isnull()].mean()
n = sum(df.Var126.isnull())
print "Confidence interval:", np.round(proportion_confint(proportion*n, n, alpha = 0.05, method="wilson"), 4)

```
```

```
In [95]:
```# the Cramer's correlation coefficient for categorical variables
# http://stackoverflow.com/questions/20892799/using-pandas-calculate-cramérs-coefficient-matrix
def cramers_corrected_stat(confusion_matrix):
""" calculate Cramers V statistic for categorial-categorial association.
uses correction from Bergsma and Wicher,
Journal of the Korean Statistical Society 42 (2013): 323-328
"""
chi2 = stats.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum()
phi2 = chi2/n
r,k = confusion_matrix.shape
phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
rcorr = r - ((r-1)**2)/(n-1)
kcorr = k - ((k-1)**2)/(n-1)
return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))
# wrapper for the function above
def get_cramers_cor(var1, var2):
if var1.equals(var2):
return 1.0
confusion_matrix = pd.crosstab(var1, var2)
return cramers_corrected_stat(np.array(confusion_matrix))
# build Wilson's confidence interval and check whether basline belongs to it
def in_conf_int(baseline, proportion, n, alpha = 0.05):
conf_int = proportion_confint(proportion*n, n, alpha = alpha, method="wilson")
if baseline > conf_int[0] and baseline < conf_int[1]:
return True
return False

```
In [96]:
```# calculate correlations
conf_matrix_cor = pd.DataFrame(1.0, index=cat_features, columns=cat_features)
for feat in itertools.combinations(cat_features, 2):
conf_matrix_cor.loc[feat[0], feat[1]] = get_cramers_cor(data[feat[0]], data[feat[1]])
conf_matrix_cor.loc[feat[1], feat[0]] = conf_matrix_cor.loc[feat[0], feat[1]]
# triangle mask
mask = np.zeros_like(conf_matrix_cor, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# plot params
f, ax = plt.subplots(figsize=(12, 12))
# palette
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# heat-map drawing
sns.heatmap(conf_matrix_cor, mask=mask, cmap=cmap, vmax=1.0, vmin=0.0,
square=True, xticklabels=True, yticklabels=True,
linewidths=.1, cbar_kws={"shrink": .5}, ax=ax)
plt.show()

```
```

**Conclusions:** So, we conducted primary data analysis and identified features and categorially features that can have the greatest impact on prognosis. Such analysis can be very useful when working with linear models, but when working with Trees, RandomForest and Gradient Boosting models, it may not be very useful, because these models in the preview are not sensitive to distortions in classes, and secondly use a different approach in assessing the importance of features(the so-called 'information gain').

Hereafter, I will use the model gradient boosting - XGBoost. It will have some influence on the data, such as:

- Gradienty boosting is not sensitive to the difference in scale of feature values. That is, the feature can contain the values 1 and 1000 and they do not need to be scaled, as in the case of linear models.
- Gradienty boosting in its own way determines which features to use to make the split. Sometimes these features are not very good and you need to help boosting to find useful features to split, such as adding new feature showing if feature has blanks. For categorical features - adding new feature, coded by categories frequency

** Make some utility functions for visualization of cross-validation results of XGBoost.**

```
In [32]:
```# Graph results sklearn.model_selection.learning_curve
def plot_curves(train_scores, test_scores, train_sizes, x_text="num of samples"):
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
fig = plt.figure(figsize=(10, 6), dpi=100)
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.xlabel(x_text)
plt.ylabel("score")
plt.legend(loc="best")
plt.show()
i = np.argmax(test_scores_mean)
print("Best cross-validation result ({0:.2f}) obtained for {1} ".format(test_scores_mean[i], train_sizes[i]))
# visualize score depending on number of objects
def train_learning_curve(X_train, y_train, default_params=None):
scale_pos_weight = float(sum(y_train == 0))/sum(y_train == 1)
if default_params is None:
default_params = {
'objective': 'binary:logistic',
'max_depth': 2,
'learning_rate': 0.3,
'silent': 1.0,
'scale_pos_weight':scale_pos_weight,
'n_estimators':23,
'seed': seed
}
default_params['scale_pos_weight'] = scale_pos_weight
train_sizes, train_scores, valid_scores = learning_curve(
xgb.XGBClassifier(**default_params),
X_train, y_train,
scoring='roc_auc',
cv=StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=seed), n_jobs=-1)
plot_curves(train_scores, valid_scores, train_sizes)
# Score - number of algorithms dependency visualizaion
def train_validation_curve(X_train, y_train, est_end=200, default_params=None):
scale_pos_weight = float(sum(y_train == 0))/sum(y_train == 1)
if default_params is None:
default_params = {
'objective': 'binary:logistic',
'max_depth': 2,
'learning_rate': 0.3,
'silent': 1.0,
'scale_pos_weight':scale_pos_weight,
'seed': seed
}
default_params['scale_pos_weight'] = scale_pos_weight
n_estimators_range = np.linspace(1, est_end, 10).astype('int')
train_scores, test_scores = validation_curve(
xgb.XGBClassifier(**default_params),
X_train, y_train,
param_name = 'n_estimators',
param_range = n_estimators_range,
cv=5,
scoring='roc_auc', n_jobs=-1)
plot_curves(train_scores, test_scores, n_estimators_range, x_text="num of trees")
# Cross-validation
def cross_validation(train, labels, default_params=None, num_rounds=23): # how many estimators
scale_pos_weight = float(sum(labels == 0))/sum(labels == 1)
if default_params is None:
default_params = {'objective': 'binary:logistic',
'max_depth': 2,
'learning_rate': 0.3,
'seed': seed}
default_params['scale_pos_weight'] = scale_pos_weight
dtrain = xgb.DMatrix(data=train, label=labels, silent=False, feature_names=train.columns)
cv_results = xgb.cv(default_params,
dtrain,
num_boost_round=num_rounds,
nfold=5,
metrics=['auc'],
# shuffle=True,
# stratified=True,
early_stopping_rounds=5,
seed=seed)
return cv_results

```
In [30]:
```data[cat_features] = data[cat_features].fillna('NaN').apply(lambda x: LabelEncoder().fit_transform(x))
X_train, X_hold_out, y_train, y_hold_out = train_test_split(data, labels, test_size=0.25,
random_state=seed, stratify=labels)
print 'Any missing values? ', X_hold_out.isnull().values.any()

```
```

```
In [31]:
```X_train.head()

```
Out[31]:
```

```
In [123]:
```default_params = {
'objective': 'binary:logistic',
'max_depth': 2,
'learning_rate': 0.3,
'silent': 1.0,
'seed': seed
}
train_validation_curve(X_train, y_train, est_end=100, default_params=default_params)

```
```

```
In [37]:
```default_params = {'objective': 'binary:logistic',
'max_depth': 2,
'learning_rate': 0.3,
'silent': 1.0,
'n_estimators': 24,
'seed': seed }

** How to interpret the behavior of the curves in the graphs? Comes to the rescue of the information from the sklearn package: http://scikit-learn.org/stable/modules/learning_curve.html#learning-curve**

Description of validation-curve

If the training score and the validation score are both low, the estimator will be underfitting. If the training score is high and the validation score is low, the estimator is overfitting and otherwise it is working very well. A low training score and a high validation score is usually not possible

Description of training-curve

A learning curve shows the validation and training score of an estimator for varying numbers of training samples. It is a tool to find out how much we benefit from adding more training data and whether the estimator suffers more from a variance error or a bias error. If both the validation score and the training score converge to a value that is too low with increasing size of the training set, we will not benefit much from more training data. In the following plot you can see an example: naive Bayes roughly converges to a low score.

We will probably have to use an estimator or a parametrization of the current estimator that can learn more complex concepts (i.e. has a lower bias). If the training score is much greater than the validation score for the maximum number of training samples, adding more training samples will most likely increase generalization.

To sum up, my goal of visual analysis is to make sure that the curves were on the same level, but did not go far up, because it is a signal about retraining

```
In [127]:
```_X_train, _, _y_train, __ = train_test_split(X_train, y_train, stratify=y_train, train_size=0.3, random_state=seed)
train_learning_curve(_X_train, _y_train, default_params=default_params)

```
```

```
In [128]:
```cv_results = cross_validation(_X_train, _y_train, num_rounds=23)
cv_results.tail(5)

```
Out[128]:
```

```
In [129]:
```_X_train, _, _y_train, __ = train_test_split(X_train, y_train, stratify=y_train, train_size=0.5, random_state=seed)
train_learning_curve(_X_train, _y_train, default_params=default_params)

```
```

** Compare with results of cross-validation.**

```
In [131]:
```cv_results = cross_validation(_X_train, _y_train, num_rounds=23)
cv_results.tail(5)

```
Out[131]:
```

```
In [132]:
```_X_train, _, _y_train, __ = train_test_split(X_train, y_train, stratify=y_train, train_size=0.8, random_state=seed)
train_learning_curve(_X_train, _y_train, default_params=default_params)

```
```

** Compare with results of cross-validation.**

```
In [133]:
```cv_results = cross_validation(_X_train, _y_train, num_rounds=23)
cv_results.tail(5)

```
Out[133]:
```

```
In [134]:
```train_learning_curve(X_train, y_train, default_params=default_params)
print y_train.value_counts()

```
```

** Compare with results of cross-validation.**

```
In [135]:
```cv_results = cross_validation(X_train, y_train, num_rounds=23)
cv_results.tail(5)

```
Out[135]:
```

```
In [36]:
```params = {'objective': 'binary:logistic',
'max_depth': 2,
'learning_rate': 0.3,
'silent': 1,
'n_estimators': 23,
'scale_pos_weight':float(sum(y_train == 0))/sum(y_train == 1),
'seed': seed }
gtree = xgb.XGBClassifier(**params)
gtree.fit(X_train, y_train, eval_metric='auc')
print 'Hold-out set, roc-auc: ', roc_auc_score(y_hold_out, gtree.predict_proba(X_hold_out)[:, 1])

```
```

** Often unbalanced classes samples lead to some problems while training the models. For instance model can fit perfectly the samples of biggest class and show good accuracy score, but recall score will be very poor.**

** Let's try to differentiate the sample, play with the distribution of objects into classes and conclude about how the class ratio affects the quality of the model. **

```
In [34]:
```def oversample_x(data, labels, labels_x, amount_percents=1.0):
rnd_indices = np.random.randint(0, len(labels_x), len(labels_x)*amount_percents)
new_df = pd.concat([data, data.ix[labels_x[rnd_indices]]], ignore_index=True)
new_lb = pd.concat([labels, labels.ix[labels_x[rnd_indices]]], ignore_index=True)
new_df = pd.concat([new_df, new_lb], axis=1)
# Shuffle
new_df = pd.concat([new_df[:1], new_df[1:].sample(frac=1)]).reset_index(drop=True)
return new_df.iloc[:, :-1], new_df.iloc[:,-1]
def oversample_1(data, labels, amount_percents):
label_idx = labels[labels == 1].index.values
return oversample_x(data, labels, label_idx, amount_percents)
def oversample_0(data, labels, amount_percents):
label_idx = labels[labels == 0].index.values
return oversample_x(data, labels, label_idx, amount_percents)

```
In [38]:
```train_data_3, train_labels_3 = oversample_1(X_train, y_train, 3.0)
print 'Classes proportion\n', train_labels_3.value_counts()
train_learning_curve(train_data_3, train_labels_3, default_params=default_params)

```
```

** Compare with results of cross-validation.**

```
In [139]:
```cv_results = cross_validation(train_data_3, train_labels_3, num_rounds=23)
cv_results.tail(5)

```
Out[139]:
```

```
In [39]:
```train_data_6, train_labels_6 = oversample_1(X_train, y_train, 6.0)
print 'Classes proportion\n', train_labels_6.value_counts()
train_learning_curve(train_data_6, train_labels_6, default_params=default_params)

```
```

** Compare with results of cross-validation.**

```
In [40]:
```cv_results = cross_validation(train_data_6, train_labels_6, num_rounds=23)
cv_results.tail(5)

```
Out[40]:
```

```
In [41]:
```train_data_11, train_labels_11 = oversample_1(X_train, y_train, 11.2)
print u'Classes proportion\n', train_labels_11.value_counts()
train_learning_curve(train_data_11, train_labels_11, default_params=default_params)

```
```

** Compare with results of cross-validation.**

```
In [42]:
```cv_results = cross_validation(train_data_11, train_labels_11, num_rounds=23)
cv_results.tail(5)

```
Out[42]:
```

** Compare with results on hold-out dataset. **

```
In [54]:
```params = {'objective': 'binary:logistic',
'max_depth': 2,
'learning_rate': 0.3,
'silent': 1,
'n_estimators': 23,
'scale_pos_weight':float(sum(y_train == 0))/sum(y_train == 1),
'seed': 123 }
train_data_11, train_labels_11 = oversample_1(X_train, y_train, 11.0)
gtree = xgb.XGBClassifier(**params)
gtree.fit(train_data_11, train_labels_11, eval_metric='auc')
print 'Hold-out set, roc-auc: ', roc_auc_score(y_hold_out, gtree.predict_proba(X_hold_out)[:, 1])

```
```

As can be seen from the graphs of the cross-validation and cross-validation, the more data and the more balanced the sample is, the higher the score roc_auc, but the score on the hold-out sample is not so optimistic and makes it clear that increasing the number of the same data of a smaller class may lead to overfitting. Why is this happening? Maybe it's the fact that I have enable the option of balancing classes in XGBoost and because of this, the saturation of the sample the same data leads to overfitting. Switching on and off 'scale_pos_weight' parameter of XGBoost confirms that assumption.

```
In [55]:
```def get_fresh_data():
fresh_data = pd.read_csv('orange_small_churn_train_data.csv', index_col='ID')
fresh_data.loc[fresh_data['labels'] == -1, 'labels'] = 0
return drop_empty(fresh_data)
def standart_fill_nans(data):
df_new = data.copy(deep=True)
num_features = df_new.select_dtypes(include=['float64', 'int64']).columns
cat_features = df_new.select_dtypes(include=['object']).columns
df_new[num_features] = df_new[num_features].fillna(0)
df_new[cat_features] = df_new[cat_features].fillna('NaN').apply(lambda x: LabelEncoder().fit_transform(x))
return df_new

```
In [56]:
```def undersample_0(amount_nans=50):
new_df = get_fresh_data()
train_data, train_labels = new_df.iloc[:, :-1], new_df.iloc[:,-1]
labels_0_idx = train_labels[train_labels == 0].index
to_drop_idxs = []
for row in train_data.ix[labels_0_idx].iterrows():
if row[1].isnull().sum() > amount_nans:
to_drop_idxs.append(row[0])
train_data, train_labels = train_data.drop(train_data.index[to_drop_idxs]), train_labels.drop(train_labels.index[to_drop_idxs])
return train_data, train_labels

```
In [57]:
```train_data_und_40, train_labels_und_40 = undersample_0(40)
train_data_und_40 = standart_fill_nans(train_data_und_40)
print u'Classes proportion\n', train_labels_und_40.value_counts()
train_learning_curve(train_data_und_40, train_labels_und_40, default_params=default_params)

```
```