Introduction to parameter tuning

Hyper-parameters

A machine learning model is a mathematical formula with a number of parameters that are learnt from the data. That is the crux of machine learning: fitting a model to the data.

However, there is another kind of parameters that cannot be directly learned from the regular training process. These parameters express “higher-level” properties of the model such as its complexity or how fast it should learn. They are called hyperparameters. Hyperparameters are usually fixed before the actual training process begins.

So, how are hyperparameters decided?

Broadly speaking, this is done by setting different values for those hyperparameters, training different models, and deciding which ones work best by testing them

So, to summarize. Hyperparameters:

  • Define higher level concepts about the model such as complexity, or capacity to learn.
  • Cannot be learned directly from the data in the standard model training process and need to be predefined.
  • Can be decided by setting different values, training different models, and choosing the values that test better

Some examples of hyperparameters:

  • Number of leaves or depth of a tree
  • Number of latent factors in a matrix factorization
  • Learning rate (in many models)
  • Number of hidden layers in a deep neural network
  • Number of clusters in a k-means clustering

source: Quora


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')

In [2]:
#Read the data 
#Read the data
df = pd.read_csv("data/historical_loan.csv")

# refine the data
df.years = df.years.fillna(np.mean(df.years))

In [3]:
# Setup the features and target
X = df.iloc[:,1:]
y = df.iloc[:,0]

In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Basic checks

Check if the columns are the same in train and test.

What else will you check? [Discuss]


In [5]:
X_train.columns


Out[5]:
Index(['amount', 'grade', 'years', 'ownership', 'income', 'age'], dtype='object')

In [6]:
X_test.columns


Out[6]:
Index(['amount', 'grade', 'years', 'ownership', 'income', 'age'], dtype='object')

In [7]:
print(X_train.shape, X_test.shape)


(6181, 6) (1546, 6)

In [9]:
print("train")
print(X_train.dtypes)
print()
print("test")
print(X_test.dtypes)


train
amount         int64
grade         object
years        float64
ownership     object
income       float64
age            int64
dtype: object

test
amount         int64
grade         object
years        float64
ownership     object
income       float64
age            int64
dtype: object

The categorical data should be encoded.

We saw LabelEncoder earlier. Now, we will use one-hot encoding

One-hot encoding


In [11]:
X_train_updated = pd.get_dummies(X_train)

In [12]:
X_train.shape


Out[12]:
(6181, 6)

In [13]:
X_train_updated.shape


Out[13]:
(6181, 15)

In [15]:
#print the first record
X_train_updated.iloc[0]


Out[15]:
amount                14500.0
years                    11.0
income                64000.0
age                      35.0
grade_A                   1.0
grade_B                   0.0
grade_C                   0.0
grade_D                   0.0
grade_E                   0.0
grade_F                   0.0
grade_G                   0.0
ownership_MORTGAGE        1.0
ownership_OTHER           0.0
ownership_OWN             0.0
ownership_RENT            0.0
Name: 303, dtype: float64

Exercise Apply one-hot encoding to test dataset and store in test_updated


In [ ]:
#Code here

In [16]:
X_test_updated = pd.get_dummies(X_test)

In [18]:
print(X_test.shape, X_test_updated.shape)


(1546, 6) (1546, 15)

In [20]:
#print the first record
X_test_updated.iloc[1]


Out[20]:
amount                 3000.0
years                     1.0
income                49800.0
age                      22.0
grade_A                   1.0
grade_B                   0.0
grade_C                   0.0
grade_D                   0.0
grade_E                   0.0
grade_F                   0.0
grade_G                   0.0
ownership_MORTGAGE        0.0
ownership_OTHER           0.0
ownership_OWN             0.0
ownership_RENT            1.0
Name: 2184, dtype: float64

In [23]:
print(X_train_updated.shape, y_train.shape)


(6181, 15) (6181,)

In [24]:
#Let's build random forest model

In [25]:
from sklearn.ensemble import RandomForestClassifier

In [43]:
model_rf = RandomForestClassifier(n_estimators=100,
                                 criterion="gini",
                                 max_depth=5,
                                 min_samples_split=2,
                                 min_samples_leaf= 1,
                                 oob_score=True,
                                 n_jobs=-1
                                 )

In [44]:
model_rf.fit(X_train_updated, y_train)


Out[44]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
            oob_score=True, random_state=None, verbose=0, warm_start=False)

In [45]:
model_rf.oob_score_


Out[45]:
0.63873159682899205

Let's do cross validation and see what the generalization error is

Cross-validation


In [46]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_curve, auc

In [47]:
model_rf = RandomForestClassifier(n_estimators=100,
                                 criterion="gini",
                                 max_depth=5,
                                 min_samples_split=2,
                                 min_samples_leaf= 1,
                                 oob_score=True,
                                 n_jobs=-1
                                 )

In [48]:
%%time

#Or use %%timeit -n1 -r1 to time the cell

cross_val_score_rf = cross_val_score(model_rf, 
                                     X_train_updated, 
                                     y_train, scoring="roc_auc",
                                     cv=5,
                                     n_jobs=-1
                                    )


CPU times: user 112 ms, sys: 64.7 ms, total: 176 ms
Wall time: 2.18 s

In [49]:
cross_val_score_rf


Out[49]:
array([ 0.6969647 ,  0.68786796,  0.69946444,  0.69435555,  0.67146693])

Exercise


In [50]:
#What is the average cross validation score?
np.mean(cross_val_score_rf)


Out[50]:
0.69002391398907892

The above was for some arbitrary chosen parameter value.

How do we run the model on various choices of hyper-parameters?


In [51]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

In [59]:
%%timeit -n1 -r1

# Set the parameters by cross-validation
tuned_parameters = [{'n_estimators': [50,100], 
                     'max_depth': [3, 4, 5, 6]
                    }]

scores = ['roc_auc']

for score in scores:
    print("# Tuning hyper-parameters for %s" % score)
    print()

    clf = GridSearchCV(RandomForestClassifier(n_jobs=-1), 
                       tuned_parameters, cv=5,
                       scoring='%s' % score)
    clf
    clf.fit(X_train_updated, y_train)

    print("Best parameters set found on development set:")
    print()
    print(clf.best_params_)
    print()
    print("Grid scores on development set:")
    print()
    means = clf.cv_results_['mean_test_score']
    stds = clf.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, clf.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params))
    print()

    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    print()
    y_true, y_pred = y_test,  clf.predict(X_test_updated)
    
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_true, y_pred)
    roc_auc = auc(false_positive_rate, true_positive_rate)
    print("AUC:", roc_auc)
    
    print(classification_report(y_true, y_pred))
    print()


# Tuning hyper-parameters for roc_auc

Best parameters set found on development set:

{'max_depth': 6, 'n_estimators': 100}

Grid scores on development set:

0.684 (+/-0.022) for {'max_depth': 3, 'n_estimators': 50}
0.684 (+/-0.022) for {'max_depth': 3, 'n_estimators': 100}
0.687 (+/-0.018) for {'max_depth': 4, 'n_estimators': 50}
0.687 (+/-0.022) for {'max_depth': 4, 'n_estimators': 100}
0.687 (+/-0.016) for {'max_depth': 5, 'n_estimators': 50}
0.690 (+/-0.021) for {'max_depth': 5, 'n_estimators': 100}
0.691 (+/-0.022) for {'max_depth': 6, 'n_estimators': 50}
0.692 (+/-0.020) for {'max_depth': 6, 'n_estimators': 100}

Detailed classification report:

The model is trained on the full development set.
The scores are computed on the full evaluation set.

AUC: 0.630219677953
             precision    recall  f1-score   support

          0       0.63      0.71      0.67       807
          1       0.64      0.55      0.59       739

avg / total       0.63      0.63      0.63      1546


1 loop, best of 1: 22.9 s per loop

Exercise

  • For max_depth include - 6, 10
  • Add min_samples_split, min_samples_leaf to the grid search
  • In addition to roc_auc, add precision and recall

In [ ]:

Challenges with grid_search

Discuss


In [ ]:


In [56]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint as sp_randint

In [60]:
%%timeit -n1 -r1

# Set the parameters by cross-validation
tuned_parameters = { "n_estimators": [50,100], 
                      "max_depth": [3, 4, 6, None],
                      "max_features": sp_randint(1, 11),
                      "min_samples_split": sp_randint(2, 11),
                      "min_samples_leaf": sp_randint(1, 11),
                      "bootstrap": [True, False],
                      "criterion": ["gini", "entropy"]
                    }

scores = ['roc_auc']


n_iter_search = 20

for score in scores:
    print("# Tuning hyper-parameters for %s" % score)
    print()

    clf = RandomizedSearchCV(RandomForestClassifier(n_jobs=-1), 
                       param_distributions = tuned_parameters, 
                             n_iter = n_iter_search,
                             n_jobs=-1,
                             cv=5,
                       scoring='%s' % score)
    clf.fit(X_train_updated, y_train)

    print("Best parameters set found on development set:")
    print()
    print(clf.best_params_)
    print()
    print("Grid scores on development set:")
    print()
    means = clf.cv_results_['mean_test_score']
    stds = clf.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, clf.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params))
    print()

    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    print()
    y_true, y_pred = y_test,  clf.predict(X_test_updated)
    
    #false_positive_rate, true_positive_rate, thresholds = roc_curve(y_true, y_pred)
    #roc_auc = auc(false_positive_rate, true_positive_rate)
    #print("AUC:", roc_auc)
    
    #print(classification_report(y_true, y_pred))
    #print()


# Tuning hyper-parameters for roc_auc

Best parameters set found on development set:

{'bootstrap': False, 'criterion': 'gini', 'max_depth': None, 'max_features': 3, 'min_samples_leaf': 4, 'min_samples_split': 5, 'n_estimators': 50}

Grid scores on development set:

0.702 (+/-0.024) for {'bootstrap': True, 'criterion': 'gini', 'max_depth': None, 'max_features': 7, 'min_samples_leaf': 5, 'min_samples_split': 4, 'n_estimators': 100}
0.686 (+/-0.020) for {'bootstrap': True, 'criterion': 'gini', 'max_depth': 4, 'max_features': 5, 'min_samples_leaf': 7, 'min_samples_split': 7, 'n_estimators': 50}
0.687 (+/-0.016) for {'bootstrap': True, 'criterion': 'gini', 'max_depth': 4, 'max_features': 8, 'min_samples_leaf': 9, 'min_samples_split': 2, 'n_estimators': 100}
0.685 (+/-0.018) for {'bootstrap': False, 'criterion': 'gini', 'max_depth': 4, 'max_features': 8, 'min_samples_leaf': 7, 'min_samples_split': 2, 'n_estimators': 100}
0.690 (+/-0.019) for {'bootstrap': True, 'criterion': 'entropy', 'max_depth': 6, 'max_features': 4, 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 50}
0.685 (+/-0.019) for {'bootstrap': True, 'criterion': 'entropy', 'max_depth': 3, 'max_features': 6, 'min_samples_leaf': 3, 'min_samples_split': 2, 'n_estimators': 50}
0.703 (+/-0.024) for {'bootstrap': True, 'criterion': 'entropy', 'max_depth': None, 'max_features': 1, 'min_samples_leaf': 4, 'min_samples_split': 5, 'n_estimators': 100}
0.682 (+/-0.023) for {'bootstrap': False, 'criterion': 'entropy', 'max_depth': 3, 'max_features': 10, 'min_samples_leaf': 3, 'min_samples_split': 10, 'n_estimators': 100}
0.697 (+/-0.023) for {'bootstrap': True, 'criterion': 'entropy', 'max_depth': None, 'max_features': 9, 'min_samples_leaf': 5, 'min_samples_split': 2, 'n_estimators': 50}
0.684 (+/-0.018) for {'bootstrap': False, 'criterion': 'gini', 'max_depth': 3, 'max_features': 7, 'min_samples_leaf': 8, 'min_samples_split': 4, 'n_estimators': 100}
0.686 (+/-0.019) for {'bootstrap': False, 'criterion': 'entropy', 'max_depth': 4, 'max_features': 5, 'min_samples_leaf': 7, 'min_samples_split': 7, 'n_estimators': 50}
0.692 (+/-0.019) for {'bootstrap': True, 'criterion': 'gini', 'max_depth': 6, 'max_features': 8, 'min_samples_leaf': 9, 'min_samples_split': 3, 'n_estimators': 100}
0.693 (+/-0.020) for {'bootstrap': True, 'criterion': 'entropy', 'max_depth': 6, 'max_features': 4, 'min_samples_leaf': 7, 'min_samples_split': 8, 'n_estimators': 100}
0.692 (+/-0.018) for {'bootstrap': True, 'criterion': 'gini', 'max_depth': 6, 'max_features': 5, 'min_samples_leaf': 7, 'min_samples_split': 5, 'n_estimators': 100}
0.703 (+/-0.021) for {'bootstrap': False, 'criterion': 'entropy', 'max_depth': None, 'max_features': 2, 'min_samples_leaf': 6, 'min_samples_split': 6, 'n_estimators': 50}
0.692 (+/-0.019) for {'bootstrap': True, 'criterion': 'gini', 'max_depth': 6, 'max_features': 8, 'min_samples_leaf': 3, 'min_samples_split': 10, 'n_estimators': 100}
0.689 (+/-0.020) for {'bootstrap': False, 'criterion': 'entropy', 'max_depth': 6, 'max_features': 6, 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 50}
0.701 (+/-0.022) for {'bootstrap': True, 'criterion': 'entropy', 'max_depth': None, 'max_features': 4, 'min_samples_leaf': 8, 'min_samples_split': 2, 'n_estimators': 100}
0.704 (+/-0.026) for {'bootstrap': False, 'criterion': 'gini', 'max_depth': None, 'max_features': 3, 'min_samples_leaf': 4, 'min_samples_split': 5, 'n_estimators': 50}
0.687 (+/-0.020) for {'bootstrap': True, 'criterion': 'entropy', 'max_depth': 4, 'max_features': 2, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}

Detailed classification report:

The model is trained on the full development set.
The scores are computed on the full evaluation set.

1 loop, best of 1: 30.7 s per loop

In [ ]: