In [1]:
import pandas as pd
import numpy as np
import scipy
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn import ensemble
from sklearn import datasets
from sklearn.utils import shuffle
from sklearn.metrics import mean_squared_error

Gradient boost guided example

Having walked through gradient boost by hand, now let's try it with SKlearn. We'll still use the European Social Survey Data, but now with a categorical outcome: Whether or not someone lives with a partner.


In [28]:
df = pd.read_csv((
    "https://raw.githubusercontent.com/Thinkful-Ed/data-201-resources/"
    "master/ESS_practice_data/ESSdata_Thinkful.csv")).dropna()

# Define outcome and predictors.
# Set our outcome to 0 and 1.
y = df['partner'] - 1
X = df.loc[:, ~df.columns.isin(['partner', 'cntry', 'idno'])]

# Make the categorical variable 'country' into dummies.
X = pd.concat([X, pd.get_dummies(df['cntry'])], axis=1)
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.10)
# As originally designed, this dataset trains on non-Estonians and tests on Estonians in an imbalanced way. 
# We may have better results doing a cross-validation. 
# Create training and test sets.
#offset = int(X.shape[0] * 0.9)

# Put 90% of the data in the training set.
#X_train, y_train = X[:offset], y[:offset]

# And put 10% in the test set.
#X_test, y_test = X[offset:], y[offset:]

Since we're now working with a binary outcome, we've switched to a classifier. Now our loss function can't be the residuals. Our options are "deviance", or "exponential". Deviance is used for logistic regression, and we'll try that here.


In [29]:
# We'll make 500 iterations, use 2-deep trees, and set our loss function.
params = {'n_estimators': 1000,
          'max_depth': 3,
          'loss': 'exponential',
          'learning_rate' : 0.01
         }

# Initialize and fit the model.
clf = ensemble.GradientBoostingClassifier(**params)
clf.fit(X_train, y_train)

predict_train = clf.predict(X_train)
predict_test = clf.predict(X_test)

# Accuracy tables.
table_train = pd.crosstab(y_train, predict_train, margins=True)
table_test = pd.crosstab(y_test, predict_test, margins=True)

train_tI_errors = table_train.loc[0.0,1.0] / table_train.loc['All','All']
train_tII_errors = table_train.loc[1.0,0.0] / table_train.loc['All','All']

test_tI_errors = table_test.loc[0.0,1.0]/table_test.loc['All','All']
test_tII_errors = table_test.loc[1.0,0.0]/table_test.loc['All','All']

print((
    'Training set accuracy:\n'
    'Percent Type I errors: {}\n'
    'Percent Type II errors: {}\n\n'
    'Test set accuracy:\n'
    'Percent Type I errors: {}\n'
    'Percent Type II errors: {}'
).format(train_tI_errors, train_tII_errors, test_tI_errors, test_tII_errors))


Training set accuracy:
Percent Type I errors: 0.04187124931805783
Percent Type II errors: 0.1853518821603928

Test set accuracy:
Percent Type I errors: 0.04539877300613497
Percent Type II errors: 0.2049079754601227

Unlike decision trees, gradient boost solutions are not terribly easy to interpret on the surface. But they aren't quite a black box. We can get a measure of how important various features are by counting how many times a feature is used over the course of many decision trees.


In [30]:
feature_importance = clf.feature_importances_

# Make importances relative to max importance.
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
plt.subplot(1, 2, 2)
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, X.columns[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.show()


It appears that age and happiness are the most important features in predicting whether or not someone lives with a partner.

DRILL: Improve this gradient boost model

While this model is already doing alright, we've seen from the Type I and Type II error rates that there is definitely room for improvement. Your task is to see how low you can get the error rates to go in the test set, based on your model in the training set. Strategies you might use include:

  • Creating new features
  • Applying more overfitting-prevention strategies like subsampling
  • More iterations
  • Trying a different loss function
  • Changing the structure of the weak learner: Allowing more leaves in the tree, or other modifications

Have fun!

Original error rates:

Training set accuracy:

Percent Type I errors: 0.04650845608292417

Percent Type II errors: 0.17607746863066012

Test set accuracy:

Percent Type I errors: 0.06257668711656442

Percent Type II errors: 0.18527607361963191

Final error rates:

Training set accuracy:

Percent Type I errors: 0.04187124931805783

Percent Type II errors: 0.1853518821603928

Test set accuracy:

Percent Type I errors: 0.04539877300613497

Percent Type II errors: 0.2049079754601227

Result: A ~2% decrease in Type I errors on the test set.

Summary of changes: The max_depth, number of iterations, and loss functions were changed. Additionally, train_test_split was implemented to ensure a better sampling distribution.


In [ ]: