In [1]:
import pandas as pd
import numpy as np
import scipy
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import ensemble
from sklearn import datasets
from sklearn.utils import shuffle
from sklearn.metrics import mean_squared_error
In [28]:
df = pd.read_csv((
"https://raw.githubusercontent.com/Thinkful-Ed/data-201-resources/"
"master/ESS_practice_data/ESSdata_Thinkful.csv")).dropna()
# Define outcome and predictors.
# Set our outcome to 0 and 1.
y = df['partner'] - 1
X = df.loc[:, ~df.columns.isin(['partner', 'cntry', 'idno'])]
# Make the categorical variable 'country' into dummies.
X = pd.concat([X, pd.get_dummies(df['cntry'])], axis=1)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.10)
# As originally designed, this dataset trains on non-Estonians and tests on Estonians in an imbalanced way.
# We may have better results doing a cross-validation.
# Create training and test sets.
#offset = int(X.shape[0] * 0.9)
# Put 90% of the data in the training set.
#X_train, y_train = X[:offset], y[:offset]
# And put 10% in the test set.
#X_test, y_test = X[offset:], y[offset:]
Since we're now working with a binary outcome, we've switched to a classifier. Now our loss function can't be the residuals. Our options are "deviance", or "exponential". Deviance is used for logistic regression, and we'll try that here.
In [29]:
# We'll make 500 iterations, use 2-deep trees, and set our loss function.
params = {'n_estimators': 1000,
'max_depth': 3,
'loss': 'exponential',
'learning_rate' : 0.01
}
# Initialize and fit the model.
clf = ensemble.GradientBoostingClassifier(**params)
clf.fit(X_train, y_train)
predict_train = clf.predict(X_train)
predict_test = clf.predict(X_test)
# Accuracy tables.
table_train = pd.crosstab(y_train, predict_train, margins=True)
table_test = pd.crosstab(y_test, predict_test, margins=True)
train_tI_errors = table_train.loc[0.0,1.0] / table_train.loc['All','All']
train_tII_errors = table_train.loc[1.0,0.0] / table_train.loc['All','All']
test_tI_errors = table_test.loc[0.0,1.0]/table_test.loc['All','All']
test_tII_errors = table_test.loc[1.0,0.0]/table_test.loc['All','All']
print((
'Training set accuracy:\n'
'Percent Type I errors: {}\n'
'Percent Type II errors: {}\n\n'
'Test set accuracy:\n'
'Percent Type I errors: {}\n'
'Percent Type II errors: {}'
).format(train_tI_errors, train_tII_errors, test_tI_errors, test_tII_errors))
Unlike decision trees, gradient boost solutions are not terribly easy to interpret on the surface. But they aren't quite a black box. We can get a measure of how important various features are by counting how many times a feature is used over the course of many decision trees.
In [30]:
feature_importance = clf.feature_importances_
# Make importances relative to max importance.
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
plt.subplot(1, 2, 2)
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, X.columns[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.show()
It appears that age and happiness are the most important features in predicting whether or not someone lives with a partner.
While this model is already doing alright, we've seen from the Type I and Type II error rates that there is definitely room for improvement. Your task is to see how low you can get the error rates to go in the test set, based on your model in the training set. Strategies you might use include:
Have fun!
Original error rates:
Training set accuracy:
Percent Type I errors: 0.04650845608292417
Percent Type II errors: 0.17607746863066012
Test set accuracy:
Percent Type I errors: 0.06257668711656442
Percent Type II errors: 0.18527607361963191
Final error rates:
Training set accuracy:
Percent Type I errors: 0.04187124931805783
Percent Type II errors: 0.1853518821603928
Test set accuracy:
Percent Type I errors: 0.04539877300613497
Percent Type II errors: 0.2049079754601227
Result: A ~2% decrease in Type I errors on the test set.
Summary of changes: The max_depth, number of iterations, and loss functions were changed. Additionally, train_test_split was implemented to ensure a better sampling distribution.
In [ ]: