Hyper-parameters
A machine learning model is a mathematical formula with a number of parameters that are learnt from the data. That is the crux of machine learning: fitting a model to the data.
However, there is another kind of parameters that cannot be directly learned from the regular training process. These parameters express “higher-level” properties of the model such as its complexity or how fast it should learn. They are called hyperparameters. Hyperparameters are usually fixed before the actual training process begins.
So, how are hyperparameters decided?
Broadly speaking, this is done by setting different values for those hyperparameters, training different models, and deciding which ones work best by testing them
So, to summarize. Hyperparameters:
Some examples of hyperparameters:
source: Quora
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')
In [2]:
#Read the data
#Read the data
df = pd.read_csv("data/historical_loan.csv")
# refine the data
df.years = df.years.fillna(np.mean(df.years))
In [3]:
# Setup the features and target
X = df.iloc[:,1:]
y = df.iloc[:,0]
In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Basic checks
Check if the columns are the same in train and test.
What else will you check? [Discuss]
In [5]:
X_train.columns
Out[5]:
In [6]:
X_test.columns
Out[6]:
In [7]:
print(X_train.shape, X_test.shape)
In [9]:
print("train")
print(X_train.dtypes)
print()
print("test")
print(X_test.dtypes)
The categorical data should be encoded.
We saw LabelEncoder earlier. Now, we will use one-hot encoding
In [11]:
X_train_updated = pd.get_dummies(X_train)
In [12]:
X_train.shape
Out[12]:
In [13]:
X_train_updated.shape
Out[13]:
In [15]:
#print the first record
X_train_updated.iloc[0]
Out[15]:
Exercise Apply one-hot encoding to test dataset and store in test_updated
In [ ]:
#Code here
In [16]:
X_test_updated = pd.get_dummies(X_test)
In [18]:
print(X_test.shape, X_test_updated.shape)
In [20]:
#print the first record
X_test_updated.iloc[1]
Out[20]:
In [23]:
print(X_train_updated.shape, y_train.shape)
In [24]:
#Let's build random forest model
In [25]:
from sklearn.ensemble import RandomForestClassifier
In [43]:
model_rf = RandomForestClassifier(n_estimators=100,
criterion="gini",
max_depth=5,
min_samples_split=2,
min_samples_leaf= 1,
oob_score=True,
n_jobs=-1
)
In [44]:
model_rf.fit(X_train_updated, y_train)
Out[44]:
In [45]:
model_rf.oob_score_
Out[45]:
Let's do cross validation and see what the generalization error is
In [46]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_curve, auc
In [47]:
model_rf = RandomForestClassifier(n_estimators=100,
criterion="gini",
max_depth=5,
min_samples_split=2,
min_samples_leaf= 1,
oob_score=True,
n_jobs=-1
)
In [48]:
%%time
#Or use %%timeit -n1 -r1 to time the cell
cross_val_score_rf = cross_val_score(model_rf,
X_train_updated,
y_train, scoring="roc_auc",
cv=5,
n_jobs=-1
)
In [49]:
cross_val_score_rf
Out[49]:
Exercise
In [50]:
#What is the average cross validation score?
np.mean(cross_val_score_rf)
Out[50]:
The above was for some arbitrary chosen parameter value.
How do we run the model on various choices of hyper-parameters?
In [51]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
In [59]:
%%timeit -n1 -r1
# Set the parameters by cross-validation
tuned_parameters = [{'n_estimators': [50,100],
'max_depth': [3, 4, 5, 6]
}]
scores = ['roc_auc']
for score in scores:
print("# Tuning hyper-parameters for %s" % score)
print()
clf = GridSearchCV(RandomForestClassifier(n_jobs=-1),
tuned_parameters, cv=5,
scoring='%s' % score)
clf
clf.fit(X_train_updated, y_train)
print("Best parameters set found on development set:")
print()
print(clf.best_params_)
print()
print("Grid scores on development set:")
print()
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
print("%0.3f (+/-%0.03f) for %r"
% (mean, std * 2, params))
print()
print("Detailed classification report:")
print()
print("The model is trained on the full development set.")
print("The scores are computed on the full evaluation set.")
print()
y_true, y_pred = y_test, clf.predict(X_test_updated)
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_true, y_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
print("AUC:", roc_auc)
print(classification_report(y_true, y_pred))
print()
Exercise
max_depth include - 6, 10min_samples_split, min_samples_leaf to the grid searchroc_auc, add precision and recall
In [ ]:
Challenges with grid_search
Discuss
In [ ]:
In [56]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint as sp_randint
In [60]:
%%timeit -n1 -r1
# Set the parameters by cross-validation
tuned_parameters = { "n_estimators": [50,100],
"max_depth": [3, 4, 6, None],
"max_features": sp_randint(1, 11),
"min_samples_split": sp_randint(2, 11),
"min_samples_leaf": sp_randint(1, 11),
"bootstrap": [True, False],
"criterion": ["gini", "entropy"]
}
scores = ['roc_auc']
n_iter_search = 20
for score in scores:
print("# Tuning hyper-parameters for %s" % score)
print()
clf = RandomizedSearchCV(RandomForestClassifier(n_jobs=-1),
param_distributions = tuned_parameters,
n_iter = n_iter_search,
n_jobs=-1,
cv=5,
scoring='%s' % score)
clf.fit(X_train_updated, y_train)
print("Best parameters set found on development set:")
print()
print(clf.best_params_)
print()
print("Grid scores on development set:")
print()
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
print("%0.3f (+/-%0.03f) for %r"
% (mean, std * 2, params))
print()
print("Detailed classification report:")
print()
print("The model is trained on the full development set.")
print("The scores are computed on the full evaluation set.")
print()
y_true, y_pred = y_test, clf.predict(X_test_updated)
#false_positive_rate, true_positive_rate, thresholds = roc_curve(y_true, y_pred)
#roc_auc = auc(false_positive_rate, true_positive_rate)
#print("AUC:", roc_auc)
#print(classification_report(y_true, y_pred))
#print()
In [ ]: