Blitz Classifiers (v0.01)

Here we will present Blitz Classifiers in Scikit-Learn.

The main idea here is to use a simple concept to choose the best algorithm that fit in your data.

Note the main funciton of Blitz Classifiers it's to simplify the initial algorithm and after that, you as a Machine Learning Engineer can choose the best algorithm that solve your problem considering complexity, scalability and knowledge.

First at all, let's import some useful libraries.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import cross_validation
from sklearn import metrics
from sklearn.metrics import mean_squared_error


/Users/flavio.clesio/anaconda/lib/python3.5/site-packages/pandas/computation/__init__.py:19: UserWarning: The installed version of numexpr 2.4.4 is not supported in pandas and will be not be used

  UserWarning)
/Users/flavio.clesio/anaconda/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

In this time we'll import the following classifiers of scikit-learn:

  • Random Forest
  • Gradient Boosting
  • Extra Trees
  • AdaBoost
  • SVC
  • KNeighbors
  • Decision Tree
  • Perceptron
  • Logistic Regression

In [2]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import LogisticRegression

Now we'll import a structured dataset that all columns are numeric.


In [3]:
credit = pd.read_csv('https://raw.githubusercontent.com/fclesio/learning-space/master/Datasets/02%20-%20Classification/default_credit_card.csv')

Let's see out dataset.


In [4]:
credit.head()


Out[4]:
ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 ... BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 DEFAULT
0 1 20000 2 2 1 24 2 2 -1 -1 ... 0 0 0 0 689 0 0 0 0 1
1 2 120000 2 2 2 26 -1 2 0 0 ... 3272 3455 3261 0 1000 1000 1000 0 2000 1
2 3 90000 2 2 2 34 0 0 0 0 ... 14331 14948 15549 1518 1500 1000 1000 1000 5000 0
3 4 50000 2 2 1 37 0 0 0 0 ... 28314 28959 29547 2000 2019 1200 1100 1069 1000 0
4 5 50000 1 2 1 57 -1 0 -1 0 ... 20940 19146 19131 2000 36681 10000 9000 689 679 0

5 rows × 25 columns

As we can see, we have only numerical attributes. Below, let's see some correlations with our dependent variable (DEFAULT)


In [5]:
credit.corr()["DEFAULT"]


Out[5]:
ID          -0.013952
LIMIT_BAL   -0.153520
SEX         -0.039961
EDUCATION    0.028006
MARRIAGE    -0.024339
AGE          0.013890
PAY_0        0.324794
PAY_2        0.263551
PAY_3        0.235253
PAY_4        0.216614
PAY_5        0.204149
PAY_6        0.186866
BILL_AMT1   -0.019644
BILL_AMT2   -0.014193
BILL_AMT3   -0.014076
BILL_AMT4   -0.010156
BILL_AMT5   -0.006760
BILL_AMT6   -0.005372
PAY_AMT1    -0.072929
PAY_AMT2    -0.058579
PAY_AMT3    -0.056250
PAY_AMT4    -0.056827
PAY_AMT5    -0.055124
PAY_AMT6    -0.053183
DEFAULT      1.000000
Name: DEFAULT, dtype: float64

in that part of the code, we'll select the features of our dataset to split the dataset in test and train sets.


In [6]:
features = credit.columns[1:24]
target = credit.columns[24:25]

In [7]:
# X_train: independent (target) variables for training data set
# Y_train: dependent (outcome) variable for training data set

# X_test: independent (target) variables for the test data set
# Y_test: dependent (outcome) variable for the test data set

X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(
    credit[features].values, credit['DEFAULT'].values, test_size=0.2, random_state=0)

Let's see the shape of our datasets.


In [8]:
print (X_train.shape)
print (X_test.shape)
print (Y_train.shape)
print (Y_test.shape)


(24000, 23)
(6000, 23)
(24000,)
(6000,)

Now, we'll instance our objects with the classifiers.


In [9]:
rfc = RandomForestClassifier(n_estimators=100, min_samples_leaf=10, random_state=1, n_jobs=2)
gbc = GradientBoostingClassifier()
etc = ExtraTreesClassifier()
abc = AdaBoostClassifier()
svc = SVC()
knc = KNeighborsClassifier()
dtc = DecisionTreeClassifier()
ptc = Perceptron()
lrc = LogisticRegression()

With the training sets, we'll fit all models for each classifier.


In [10]:
rfc.fit(X_train, Y_train)
gbc.fit(X_train, Y_train)
etc.fit(X_train, Y_train)
abc.fit(X_train, Y_train)
svc.fit(X_train, Y_train)
knc.fit(X_train, Y_train)
dtc.fit(X_train, Y_train)
ptc.fit(X_train, Y_train)
lrc.fit(X_train, Y_train)


Out[10]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

We'll build an object called expected with our target variables of training set. We'll use this to see the adherence of the model and see the errors.


In [11]:
expected = Y_train

Now we'll use the predict method over our training atributes to build every prediction object.


In [12]:
predicted_rfc = rfc.predict(X_train)
predicted_gbc = gbc.predict(X_train)
predicted_etc = etc.predict(X_train)
predicted_abc = abc.predict(X_train)
predicted_svc = svc.predict(X_train)
predicted_knc = knc.predict(X_train)
predicted_dtc = dtc.predict(X_train)
predicted_ptc = ptc.predict(X_train)
predicted_lrc = lrc.predict(X_train)

If you feel confortable to see every classification report, feel free to execute this code below (will be deprecated in next version).


In [13]:
print(metrics.classification_report(expected, predicted_rfc))
print(metrics.classification_report(expected, predicted_gbc))
print(metrics.classification_report(expected, predicted_etc))
print(metrics.classification_report(expected, predicted_abc))
print(metrics.classification_report(expected, predicted_svc))
print(metrics.classification_report(expected, predicted_knc))
print(metrics.classification_report(expected, predicted_dtc))
print(metrics.classification_report(expected, predicted_ptc))
print(metrics.classification_report(expected, predicted_lrc))


             precision    recall  f1-score   support

          0       0.86      0.97      0.91     18661
          1       0.80      0.44      0.57      5339

avg / total       0.84      0.85      0.83     24000

             precision    recall  f1-score   support

          0       0.84      0.95      0.89     18661
          1       0.70      0.38      0.49      5339

avg / total       0.81      0.83      0.80     24000

             precision    recall  f1-score   support

          0       1.00      1.00      1.00     18661
          1       1.00      1.00      1.00      5339

avg / total       1.00      1.00      1.00     24000

             precision    recall  f1-score   support

          0       0.83      0.96      0.89     18661
          1       0.68      0.32      0.43      5339

avg / total       0.80      0.82      0.79     24000

             precision    recall  f1-score   support

          0       0.99      1.00      1.00     18661
          1       1.00      0.98      0.99      5339

avg / total       0.99      0.99      0.99     24000

             precision    recall  f1-score   support

          0       0.83      0.95      0.89     18661
          1       0.67      0.34      0.45      5339

avg / total       0.80      0.82      0.79     24000

             precision    recall  f1-score   support

          0       1.00      1.00      1.00     18661
          1       1.00      1.00      1.00      5339

avg / total       1.00      1.00      1.00     24000

             precision    recall  f1-score   support

          0       0.82      0.44      0.57     18661
          1       0.25      0.67      0.37      5339

avg / total       0.69      0.49      0.52     24000

             precision    recall  f1-score   support

          0       0.78      1.00      0.87     18661
          1       0.00      0.00      0.00      5339

avg / total       0.60      0.78      0.68     24000

The same above applies for the confusion matrix for each classifier.


In [14]:
print(metrics.confusion_matrix(expected, predicted_rfc))
print(metrics.confusion_matrix(expected, predicted_gbc))
print(metrics.confusion_matrix(expected, predicted_etc))
print(metrics.confusion_matrix(expected, predicted_abc))
print(metrics.confusion_matrix(expected, predicted_svc))
print(metrics.confusion_matrix(expected, predicted_knc))
print(metrics.confusion_matrix(expected, predicted_dtc))
print(metrics.confusion_matrix(expected, predicted_ptc))
print(metrics.confusion_matrix(expected, predicted_lrc))


[[18083   578]
 [ 3006  2333]]
[[17776   885]
 [ 3314  2025]]
[[18660     1]
 [    8  5331]]
[[17884   777]
 [ 3656  1683]]
[[18642    19]
 [  125  5214]]
[[17785   876]
 [ 3542  1797]]
[[18660     1]
 [    8  5331]]
[[ 8136 10525]
 [ 1777  3562]]
[[18659     2]
 [ 5339     0]]

Now we'll predict with our test dataset to see the adherence of our models.


In [15]:
predictions_rfc = rfc.predict(X_test)
predictions_gbc = gbc.predict(X_test)
predictions_etc = etc.predict(X_test)
predictions_abc = abc.predict(X_test)
predictions_svc = svc.predict(X_test)
predictions_knc = knc.predict(X_test)
predictions_dtc = dtc.predict(X_test)
predictions_ptc = ptc.predict(X_test)
predictions_lrc = lrc.predict(X_test)

Let's store our Mean Squared Error for each classifier.


In [16]:
mse_rfc = mean_squared_error(predictions_rfc, Y_test)
mse_abc = mean_squared_error(predictions_abc, Y_test)
mse_etc = mean_squared_error(predictions_etc, Y_test)
mse_gbc = mean_squared_error(predictions_gbc, Y_test)
mse_svc = mean_squared_error(predictions_svc, Y_test)
mse_knc = mean_squared_error(predictions_knc, Y_test)
mse_dtc = mean_squared_error(predictions_dtc, Y_test)
mse_ptc = mean_squared_error(predictions_ptc, Y_test)
mse_lrc = mean_squared_error(predictions_lrc, Y_test)

Now the scores:


In [17]:
print('RMSE - Random Forests:',round(mse_rfc,3) )
print('RMSE - Gradient Boosting:',round(mse_gbc,3) )
print('RMSE - Extra Trees:',round(mse_etc,3) )
print('RMSE - Ada Boosting:',round(mse_abc,3) )
print('RMSE - SVM:',round(mse_svc,3) )
print('RMSE - KNN:',round(mse_knc,3) )
print('RMSE - Decision Trees:',round(mse_dtc,3) )
print('RMSE - Perceptron:',round(mse_ptc,3) )
print('RMSE - Logistic Regression:',round(mse_lrc,3) )


RMSE - Random Forests: 0.172
RMSE - Gradient Boosting: 0.172
RMSE - Extra Trees: 0.195
RMSE - Ada Boosting: 0.174
RMSE - SVM: 0.216
RMSE - KNN: 0.238
RMSE - Decision Trees: 0.263
RMSE - Perceptron: 0.518
RMSE - Logistic Regression: 0.216

Ok, let's ranking our algorithms to see the best one to start our analysis.


In [18]:
algorithms = {'Algorithm': ['Random Forests', 'Gradient Boosting', 'Extra Trees', 'Ada Boosting', 'SVM', 'KNN', 'Decision Trees', 'Perceptron', 'Logistic Regression'],
        'MSE': [round(mse_rfc,4), round(mse_gbc,4), round(mse_etc,4), round(mse_abc,4), round(mse_svc,4), round(mse_knc,4), round(mse_dtc,4), round(mse_ptc,4), round(mse_lrc,4)]}

# Transform in a data frame of Pandas to sorting
algos = pd.DataFrame(algorithms)

algos.sort_values(by='MSE', ascending=1)


Out[18]:
Algorithm MSE
1 Gradient Boosting 0.1720
0 Random Forests 0.1723
3 Ada Boosting 0.1740
2 Extra Trees 0.1947
4 SVM 0.2158
8 Logistic Regression 0.2160
5 KNN 0.2377
6 Decision Trees 0.2632
7 Perceptron 0.5178

As we can see, the Gradient Boosting algorithm shows the best performance with default attributes for this dataset. We can start our analysis our development based in this algorithm.

There's a lot work to do, but this is the begining. Thanks for reading.