Train software matching algorithm

This is the follow-on from the previous notebook "Train vendor matching algorithm".

The training proceeds in a similar manner:

First the algorithm is tuned for typical data by using a grid search.
Next the ML classifier is run on the training data using the optimum parameters.
Finally the trained model is stored for future use.

Read in the software training data

Read in the manually labelled software training data.

Format it and convert to two numpy arrays for input to the scikit-learn ML algorithm.



In [1]:

    
# Initialize

import pandas as pd
import numpy as np

try:
    df_label_software = pd.io.parsers.read_csv(
                            "/home/jovyan/work/shared/data/csv/label_software.csv",
                            error_bad_lines=False,
                            warn_bad_lines=True,
                            quotechar='"',
                            encoding='utf-8')
except IOError as e:
    print('\n\n***I/O error({0}): {1}\n\n'.format(
                e.errno, e.strerror))

# except ValueError:
#    self.logger.critical('Could not convert data to an integer.')
except:
    print(
        '\n\n***Unexpected error: {0}\n\n'.format(
            sys.exc_info()[0]))
    raise

# Number of records / columns

df_label_software.shape









    Out[1]:





(22262, 18)



In [2]:

    
# Print out some sample values

df_label_software.sample(5)









    Out[2]:






  
    
      
      Unnamed: 0
      vendor_X
      software_X
      title_X
      DisplayName0
      release_X
      Version0
      fz_ratio
      fz_ptl_ratio
      fz_tok_set_ratio
      fz_ptl_tok_sort_ratio
      fz_uwratio
      fz_rel_ratio
      fz_rel_ptl_ratio
      t_cve_name
      titlX_len
      DsplyNm0_len
      match
    
  
  
    
      16816
      17410
      google
      chrome
      Google Chrome 18.0.1025.136
      谷歌拼音输入法 3.0
      18.0.1025.136
      -
      18
      27
      100
      26
      85
      0
      0
      cpe:/a:google:chrome:18.0.1025.136
      27
      11
      0
    
    
      178
      178
      adobe
      acrobat
      Adobe Acrobat 3.1
      Adobe Acrobat 8.3.1 - CPSID_83708
      3.1
      -
      62
      84
      100
      61
      85
      0
      0
      cpe:/a:adobe:acrobat:3.1
      17
      33
      0
    
    
      3286
      3547
      adobe
      acrobat_reader
      Adobe Acrobat Reader 8.2.2
      Adobe Acrobat 9.5.0 - CPSID_83708
      8.2.2
      -
      47
      54
      100
      53
      53
      0
      0
      cpe:/a:adobe:acrobat_reader:8.2.2
      26
      33
      0
    
    
      17726
      18320
      google
      chrome
      Google Chrome 30.0.1599.30
      谷歌拼音输入法 3.0
      30.0.1599.30
      -
      25
      36
      100
      27
      85
      0
      0
      cpe:/a:google:chrome:30.0.1599.30
      26
      11
      0
    
    
      11155
      11677
      adobe
      flash_player
      Adobe Flash Player 8.0.34.0
      Neolane v6.0 6.1.0 Build 8113
      8.0.34.0
      -
      35
      39
      100
      48
      45
      0
      0
      cpe:/a:adobe:flash_player:8.0.34.0
      27
      29
      0



In [3]:

    
# Check that all rows are labelled

# (Should return "False")

df_label_software['match'].isnull().any()









    Out[3]:





False



In [4]:

    
# Format training data as "X" == "features, "y" == target.
# The target value is the 1st column.
df_match_train1 = df_label_software[['match', 'fz_ratio', 'fz_ptl_ratio', 'fz_tok_set_ratio',
        'fz_ptl_tok_sort_ratio', 'fz_uwratio', 'fz_rel_ratio',
        'fz_rel_ptl_ratio', 'titlX_len', 'DsplyNm0_len']]

# Convert into 2 numpy arrays for the scikit-learn ML classification algorithms.
np_match_train1 = np.asarray(df_match_train1)
X, y = np_match_train1[:, 1:], np_match_train1[:, 0]

print(X.shape, y.shape)









    



((22262, 9), (22262,))

Use a grid search to tune the ML algorithm

As before, the classification algorithm needs to be tuned for optimal performance with the data.

This is done using a randomized grid search. This code was modified from the scikit-learn sample code.



In [5]:

    
#	Now find optimum parameters for model using Grid Search

from time import time
from scipy.stats import randint as sp_randint

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

# build a classifier
clf = RandomForestClassifier()

# Utility function to report best scores
def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")
            

# specify parameters and distributions to sample from
param_dist = {"n_estimators": sp_randint(20, 100),
              "max_depth": [3, None],
              "max_features": sp_randint(1,7),
              "min_samples_split": sp_randint(2,7),
              "min_samples_leaf": sp_randint(1, 7),
              "bootstrap": [True, False],
              "class_weight": ['auto', None],
              "criterion": ["gini", "entropy"]}

# run randomized search
n_iter_search = 40
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                                   n_iter=n_iter_search)

start = time()
random_search.fit(X, y)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time() - start), n_iter_search))
report(random_search.cv_results_)









    



RandomizedSearchCV took 73.01 seconds for 40 candidates parameter settings.
Model with rank: 1
Mean validation score: 0.976 (std: 0.007)
Parameters: {'bootstrap': False, 'min_samples_leaf': 1, 'n_estimators': 50, 'min_samples_split': 5, 'criterion': 'gini', 'max_features': 4, 'max_depth': 3, 'class_weight': None}

Model with rank: 2
Mean validation score: 0.976 (std: 0.009)
Parameters: {'bootstrap': False, 'min_samples_leaf': 3, 'n_estimators': 56, 'min_samples_split': 4, 'criterion': 'entropy', 'max_features': 2, 'max_depth': None, 'class_weight': None}

Model with rank: 3
Mean validation score: 0.976 (std: 0.008)
Parameters: {'bootstrap': True, 'min_samples_leaf': 5, 'n_estimators': 92, 'min_samples_split': 6, 'criterion': 'gini', 'max_features': 2, 'max_depth': None, 'class_weight': None}

Run the ML classifier with optimum parameters on the test data

Based on the above, and ignoring default values, the optimum set of parameters would be something like the following:

'bootstrap':True, 'min_samples_leaf': 3, 'n_estimators': 55, 'min_samples_split': 5, 'criterion':'gini', 'max_features': 4, 'max_depth: 3, 'class_weight': None

The RandomForest classifier is now trained on the test data to produce the model.



In [9]:

    
clf = RandomForestClassifier(
    bootstrap=True,
    min_samples_leaf=3,
    n_estimators=55,
    min_samples_split=5,
    criterion='gini',
    max_features=4,b
    max_depth=3,
    class_weight=None
)
b
# Train model on original training data
clf.fit(X, y)

# save model for future use

from sklearn.externals import joblib
joblib.dump(clf, '/home/jovyan/work/shared/data/models/software_classif_trained_Rdm_Forest.pkl.z')









    Out[9]:





['/home/jovyan/work/shared/data/models/software_classif_trained_Rdm_Forest.pkl.z']



In [10]:

    
# Test loading

clf = joblib.load('/home/jovyan/work/shared/data/models/software_classif_trained_Rdm_Forest.pkl.z')

	Unnamed: 0	vendor_X	software_X	title_X	DisplayName0	release_X	Version0	fz_ratio	fz_ptl_ratio	fz_tok_set_ratio	fz_ptl_tok_sort_ratio	fz_uwratio	t_cve_name	titlX_len	DsplyNm0_len
16816	17410	google	chrome	Google Chrome 18.0.1025.136	谷歌拼音输入法 3.0	18.0.1025.136	-	18	27	100	26	85	cpe:/a:google:chrome:18.0.1025.136	27	11
178	178	adobe	acrobat	Adobe Acrobat 3.1	Adobe Acrobat 8.3.1 - CPSID_83708	3.1	-	62	84	100	61	85	cpe:/a:adobe:acrobat:3.1	17	33
3286	3547	adobe	acrobat_reader	Adobe Acrobat Reader 8.2.2	Adobe Acrobat 9.5.0 - CPSID_83708	8.2.2	-	47	54	100	53	53	cpe:/a:adobe:acrobat_reader:8.2.2	26	33
17726	18320	google	chrome	Google Chrome 30.0.1599.30	谷歌拼音输入法 3.0	30.0.1599.30	-	25	36	100	27	85	cpe:/a:google:chrome:30.0.1599.30	26	11
11155	11677	adobe	flash_player	Adobe Flash Player 8.0.34.0	Neolane v6.0 6.1.0 Build 8113	8.0.34.0	-	35	39	100	48	45	cpe:/a:adobe:flash_player:8.0.34.0	27	29