Excercise 1 Task 2

Examination of runtime improvments of ensemble classifiers on a 250k elements dataset in dependence of the number of available cores

This notebook should run on an 8-core server to provide stable results


In [1]:
# Load neccessary libraries changed pandas import for convinience
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

In [2]:
# creation of a dataset consisting of 250k samples
# with the following parameters
samples = 250*1000
features = 40
informative = 5
redundant=4
X, Y = make_classification(n_samples=samples,
                           n_features=features,
                           n_informative=informative,
                           n_redundant=4)

In [21]:
# Split-out validation dataset
validation_size = 0.20
seed = 7
scoring = 'accuracy'
X_train, X_validation, Y_train, Y_validation = train_test_split(X,
                                                                Y,
                                                                test_size=validation_size,
                                                                random_state=seed)

Using 8 estimators (usage of one per core if 8 cores (jobs) are used)

One RandomForestClassifier (RFC) for each number of jobs (1 to 8 (inclusive)) is instantiated and trained on the training set of 200k elements. During the training the train time is measured with the magic %timeit function and stored in an array.


In [4]:
# Create Random Forest Classifier
estimators = 8  # For mapping one estimator per core in case of max 8 cores
jobs = 8
time_it_results = []
for _ in range(jobs):
    rf_class = RandomForestClassifier(n_estimators=estimators, n_jobs=(_+1))
    tr = %timeit -o rf_class.fit(X_train, Y_train)
    time_it_results.append(tr)


1 loop, best of 3: 19.8 s per loop
1 loop, best of 3: 10.5 s per loop
1 loop, best of 3: 10.7 s per loop
1 loop, best of 3: 5.83 s per loop
1 loop, best of 3: 5.33 s per loop
1 loop, best of 3: 5.69 s per loop
1 loop, best of 3: 5.29 s per loop
1 loop, best of 3: 4.49 s per loop

In [5]:
# best_times are extracted
best_times = [timer.best for timer in timer_it_results]

Plot of the training time in seconds of each RFC against the number of used cores (number of jobs)


In [22]:
x = np.arange(1,9)
labels = ['%i. Core' % i for i in x]
fig = plt.figure()
fig.suptitle('Training Time per number of cores')
ax = fig.add_subplot(111)
ax.set_xlabel('Number of cores')
ax.set_ylabel('Training time (s)')
ax.plot(x, best_times)
plt.xticks(x, labels, rotation='vertical')
plt.show()


/gpfs/software/x86_64/anaconda/envs/anaconda431-py35/lib/python3.5/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))

In [ ]: