Train software matching algorithm

This is the follow-on from the previous notebook "Train vendor matching algorithm".

The training proceeds in a similar manner:

  • First the algorithm is tuned for typical data by using a grid search.
  • Next the ML classifier is run on the training data using the optimum parameters.
  • Finally the trained model is stored for future use.

Read in the software training data

Read in the manually labelled software training data.

Format it and convert to two numpy arrays for input to the scikit-learn ML algorithm.


In [1]:
# Initialize

import pandas as pd
import numpy as np

try:
    df_label_software = pd.io.parsers.read_csv(
                            "/home/jovyan/work/shared/data/csv/label_software.csv",
                            error_bad_lines=False,
                            warn_bad_lines=True,
                            quotechar='"',
                            encoding='utf-8')
except IOError as e:
    print('\n\n***I/O error({0}): {1}\n\n'.format(
                e.errno, e.strerror))

# except ValueError:
#    self.logger.critical('Could not convert data to an integer.')
except:
    print(
        '\n\n***Unexpected error: {0}\n\n'.format(
            sys.exc_info()[0]))
    raise

# Number of records / columns

df_label_software.shape


Out[1]:
(22262, 18)

In [2]:
# Print out some sample values

df_label_software.sample(5)


Out[2]:
Unnamed: 0 vendor_X software_X title_X DisplayName0 release_X Version0 fz_ratio fz_ptl_ratio fz_tok_set_ratio fz_ptl_tok_sort_ratio fz_uwratio fz_rel_ratio fz_rel_ptl_ratio t_cve_name titlX_len DsplyNm0_len match
16816 17410 google chrome Google Chrome 18.0.1025.136 谷歌拼音输入法 3.0 18.0.1025.136 - 18 27 100 26 85 0 0 cpe:/a:google:chrome:18.0.1025.136 27 11 0
178 178 adobe acrobat Adobe Acrobat 3.1 Adobe Acrobat 8.3.1 - CPSID_83708 3.1 - 62 84 100 61 85 0 0 cpe:/a:adobe:acrobat:3.1 17 33 0
3286 3547 adobe acrobat_reader Adobe Acrobat Reader 8.2.2 Adobe Acrobat 9.5.0 - CPSID_83708 8.2.2 - 47 54 100 53 53 0 0 cpe:/a:adobe:acrobat_reader:8.2.2 26 33 0
17726 18320 google chrome Google Chrome 30.0.1599.30 谷歌拼音输入法 3.0 30.0.1599.30 - 25 36 100 27 85 0 0 cpe:/a:google:chrome:30.0.1599.30 26 11 0
11155 11677 adobe flash_player Adobe Flash Player 8.0.34.0 Neolane v6.0 6.1.0 Build 8113 8.0.34.0 - 35 39 100 48 45 0 0 cpe:/a:adobe:flash_player:8.0.34.0 27 29 0

In [3]:
# Check that all rows are labelled

# (Should return "False")

df_label_software['match'].isnull().any()


Out[3]:
False

In [4]:
# Format training data as "X" == "features, "y" == target.
# The target value is the 1st column.
df_match_train1 = df_label_software[['match', 'fz_ratio', 'fz_ptl_ratio', 'fz_tok_set_ratio',
        'fz_ptl_tok_sort_ratio', 'fz_uwratio', 'fz_rel_ratio',
        'fz_rel_ptl_ratio', 'titlX_len', 'DsplyNm0_len']]

# Convert into 2 numpy arrays for the scikit-learn ML classification algorithms.
np_match_train1 = np.asarray(df_match_train1)
X, y = np_match_train1[:, 1:], np_match_train1[:, 0]

print(X.shape, y.shape)


((22262, 9), (22262,))

Use a grid search to tune the ML algorithm

As before, the classification algorithm needs to be tuned for optimal performance with the data.

This is done using a randomized grid search. This code was modified from the scikit-learn sample code.


In [5]:
#	Now find optimum parameters for model using Grid Search

from time import time
from scipy.stats import randint as sp_randint

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

# build a classifier
clf = RandomForestClassifier()

# Utility function to report best scores
def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")
            

# specify parameters and distributions to sample from
param_dist = {"n_estimators": sp_randint(20, 100),
              "max_depth": [3, None],
              "max_features": sp_randint(1,7),
              "min_samples_split": sp_randint(2,7),
              "min_samples_leaf": sp_randint(1, 7),
              "bootstrap": [True, False],
              "class_weight": ['auto', None],
              "criterion": ["gini", "entropy"]}

# run randomized search
n_iter_search = 40
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                                   n_iter=n_iter_search)

start = time()
random_search.fit(X, y)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time() - start), n_iter_search))
report(random_search.cv_results_)


RandomizedSearchCV took 73.01 seconds for 40 candidates parameter settings.
Model with rank: 1
Mean validation score: 0.976 (std: 0.007)
Parameters: {'bootstrap': False, 'min_samples_leaf': 1, 'n_estimators': 50, 'min_samples_split': 5, 'criterion': 'gini', 'max_features': 4, 'max_depth': 3, 'class_weight': None}

Model with rank: 2
Mean validation score: 0.976 (std: 0.009)
Parameters: {'bootstrap': False, 'min_samples_leaf': 3, 'n_estimators': 56, 'min_samples_split': 4, 'criterion': 'entropy', 'max_features': 2, 'max_depth': None, 'class_weight': None}

Model with rank: 3
Mean validation score: 0.976 (std: 0.008)
Parameters: {'bootstrap': True, 'min_samples_leaf': 5, 'n_estimators': 92, 'min_samples_split': 6, 'criterion': 'gini', 'max_features': 2, 'max_depth': None, 'class_weight': None}

Run the ML classifier with optimum parameters on the test data

Based on the above, and ignoring default values, the optimum set of parameters would be something like the following:

'bootstrap':True, 'min_samples_leaf': 3, 'n_estimators': 55, 'min_samples_split': 5, 'criterion':'gini', 'max_features': 4, 'max_depth: 3, 'class_weight': None

The RandomForest classifier is now trained on the test data to produce the model.


In [9]:
clf = RandomForestClassifier(
    bootstrap=True,
    min_samples_leaf=3,
    n_estimators=55,
    min_samples_split=5,
    criterion='gini',
    max_features=4,b
    max_depth=3,
    class_weight=None
)
b
# Train model on original training data
clf.fit(X, y)

# save model for future use

from sklearn.externals import joblib
joblib.dump(clf, '/home/jovyan/work/shared/data/models/software_classif_trained_Rdm_Forest.pkl.z')


Out[9]:
['/home/jovyan/work/shared/data/models/software_classif_trained_Rdm_Forest.pkl.z']

In [10]:
# Test loading

clf = joblib.load('/home/jovyan/work/shared/data/models/software_classif_trained_Rdm_Forest.pkl.z')