Train vendor matching algorithm

The ML classification algorithm has to be retrained from time to time e.g. when scikit-learn undergoes a major release upgrade.

The specific algorithm used is a RandomForest classifier. In initial testing using a k-fold cross-validation approach, this algorithm outperformed several other simple classification algorithms

The training proceeds as follows:

  • First the algorithm is tuned for typical data by using a grid search.
  • Next the ML classifier is run on the training data using the optimum parameters.
  • Finally the trained model is stored for future use.

In [1]:
# Initialize

import pandas as pd
import numpy as np
import pip #needed to use the pip functions

# Show versions of all installed software to help debug incompatibilities.

for i in pip.get_installed_distributions(local_only=True):
    print(i)


zict 0.1.2
xmltodict 0.11.0
xlrd 1.0.0
widgetsnbextension 2.0.0
wheel 0.29.0
webencodings 0.5
wcwidth 0.1.7
vincent 0.4.4
urllib3 1.21.1
traitlets 4.3.2
tornado 4.5.1
toolz 0.8.2
testpath 0.3
terminado 0.6
tblib 1.3.2
sympy 1.0
subprocess32 3.2.7
statsmodels 0.8.0
SQLAlchemy 1.1.11
sortedcontainers 1.5.3
six 1.10.0
singledispatch 3.4.0.3
simplegeneric 0.8.1
setuptools 36.2.0
seaborn 0.7.1
scipy 0.19.1
scikit-learn 0.18.2
scikit-image 0.12.3
schedule 0.4.3
scandir 1.5
responses 0.5.1
requests 2.18.1
pyzmq 16.0.2
PyYAML 3.12
pytz 2017.2
python-Levenshtein 0.12.0
python-dateutil 2.6.0
pytest 3.1.3
PySocks 1.6.7
pyparsing 2.2.0
pyOpenSSL 16.2.0
Pygments 2.2.0
pycparser 2.18
py 1.4.34
ptyprocess 0.5.2
psutil 5.2.1
prompt-toolkit 1.0.14
pip 9.0.1
Pillow 4.2.1
pickleshare 0.7.3
pexpect 4.2.1
pbr 3.1.1
patsy 0.4.1
pathlib2 2.3.0
partd 0.3.8
pandocfilters 1.4.1
pandas 0.19.2
olefile 0.44
numpy 1.12.1
numexpr 2.6.2
numba 0.31.0+0.g3bb1d98.dirty
notebook 5.0.0
networkx 1.11
nbformat 4.3.0
nbconvert 5.2.1
msgpack-python 0.4.8
mpmath 0.19
mock 2.0.0
mistune 0.7.4
matplotlib 2.0.2
MarkupSafe 1.0
locket 0.2.0
llvmlite 0.16.0
jupyter-core 4.3.0
jupyter-client 5.1.0
jsonschema 2.5.1
Jinja2 2.9.5
ipywidgets 6.0.0
ipython 5.3.0
ipython-genutils 0.2.0
ipykernel 4.6.1
ipaddress 1.0.18
idna 2.5
html5lib 0.9999999
heapdict 1.0.0
h5py 2.6.0
fuzzywuzzy 0.15.0
futures 3.0.5
functools32 3.2.3.post2
funcsigs 1.0.2
fastcache 1.0.2
enum34 1.1.6
entrypoints 0.2.3
distributed 1.18.0
dill 0.2.6
decorator 4.0.11
dask 0.15.1
Cython 0.25.2
cryptography 1.9
cookies 2.2.1
configparser 3.5.0
cloudpickle 0.2.2
click 6.7
chardet 3.0.4
cffi 1.10.0
certifi 2017.4.17
bokeh 0.12.6
bleach 1.5.0
bkcharts 0.2
beautifulsoup4 4.5.3
backports.ssl-match-hostname 3.5.0.1
backports.shutil-get-terminal-size 1.0.0
backports-abc 0.5
asn1crypto 0.22.0
cycler 0.10.0

Read in the vendor training data

Read in the manually labelled vendor training data.

Format it and convert to two numpy arrays for input to the scikit-learn ML algorithm.


In [2]:
try:
    df_label_vendors = pd.io.parsers.read_csv(
                            "/home/jovyan/work/shared/data/csv/label_vendors.csv",
                            error_bad_lines=False,
                            warn_bad_lines=True,
                            quotechar='"',
                            encoding='utf-8')
except IOError as e:
    print('\n\n***I/O error({0}): {1}\n\n'.format(
                e.errno, e.strerror))

# except ValueError:
#    self.logger.critical('Could not convert data to an integer.')
except:
    print(
        '\n\n***Unexpected error: {0}\n\n'.format(
            sys.exc_info()[0]))
    raise

# Number of records / columns

df_label_vendors.shape


Out[2]:
(10110, 13)

In [3]:
# Print out some sample values

df_label_vendors.sample(5)


Out[3]:
Unnamed: 0 fz_ptl_ratio fz_ptl_tok_sort_ratio fz_ratio fz_tok_set_ratio fz_uwratio pub0_cln publisher0 ven_cln vendor_X match ven_len pu0_len
2245 2245 83 80 69 82 85 convert audio free convert audio free convert in convert-in 0 10 18
4956 4956 100 100 100 100 100 lavasoft lavasoft lavasoft lavasoft 1 8 8
2029 2029 80 100 48 100 90 blue cat audio blue cat audio cat cat 0 3 14
2715 2715 61 54 62 55 55 hanbit soft hanbit soft driver soft driver-soft 0 11 11
7511 7511 71 100 27 21 60 copyright 2013 esupport all rights reserved copyright © 2013 esupport.com • all rights res... serve serve 0 5 54

In [4]:
# Check that all rows are labelled

# (Should return "False")

df_label_vendors['match'].isnull().any()


Out[4]:
False

In [5]:
# Format training data as "X" == "features, "y" == target.
# The target value is the 1st column.
df_match_train1 = df_label_vendors[['match','fz_ptl_ratio', 'fz_ptl_tok_sort_ratio', 'fz_ratio', 'fz_tok_set_ratio', 'fz_uwratio','ven_len', 'pu0_len']]

# Convert into 2 numpy arrays for the scikit-learn ML classification algorithms.
np_match_train1 = np.asarray(df_match_train1)
X, y = np_match_train1[:, 1:], np_match_train1[:, 0]

print(X.shape, y.shape)


((10110, 7), (10110,))

Use a grid search to tune the ML algorithm

Once the best algorithm has been determined, it should be tuned for optimal performance with the data.

This is done using a grid search. From the scikit-learn documentation:

Parameters that are not directly learnt within estimators can be set by searching a parameter space for the best Cross-validation: evaluating estimator performance score... Any parameter provided when constructing an estimator may be optimized in this manner.

Rather than do a compute-intensive search of the entire parameter space, a randomized search is done to find reasonably efficient parameters.

This code was modified from the scikit-learn sample code.


In [6]:
#	Now find optimum parameters for model using Grid Search

from time import time
from scipy.stats import randint as sp_randint

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

# build a classifier
clf = RandomForestClassifier()

# Utility function to report best scores
def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")
            

# specify parameters and distributions to sample from
param_dist = {"n_estimators": sp_randint(20, 100),
              "max_depth": [3, None],
              "max_features": sp_randint(1,7),
              "min_samples_split": sp_randint(2,7),
              "min_samples_leaf": sp_randint(1, 7),
              "bootstrap": [True, False],
              "class_weight": ['auto', None],
              "criterion": ["gini", "entropy"]}

# run randomized search
n_iter_search = 40
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                                   n_iter=n_iter_search)

start = time()
random_search.fit(X, y)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time() - start), n_iter_search))
report(random_search.cv_results_)


RandomizedSearchCV took 54.38 seconds for 40 candidates parameter settings.
Model with rank: 1
Mean validation score: 0.980 (std: 0.005)
Parameters: {'bootstrap': True, 'min_samples_leaf': 4, 'n_estimators': 30, 'min_samples_split': 2, 'criterion': 'entropy', 'max_features': 3, 'max_depth': 3, 'class_weight': None}

Model with rank: 2
Mean validation score: 0.980 (std: 0.005)
Parameters: {'bootstrap': False, 'min_samples_leaf': 2, 'n_estimators': 39, 'min_samples_split': 6, 'criterion': 'gini', 'max_features': 4, 'max_depth': 3, 'class_weight': None}

Model with rank: 2
Mean validation score: 0.980 (std: 0.005)
Parameters: {'bootstrap': False, 'min_samples_leaf': 1, 'n_estimators': 84, 'min_samples_split': 6, 'criterion': 'entropy', 'max_features': 3, 'max_depth': 3, 'class_weight': None}

Run the ML classifier with optimum parameters on the test data

Based on the above, and ignoring default values, the optimum set of parameters would be something like the following:

'bootstrap':True, 'min_samples_leaf': 2, 'n_estimators': 40, 'min_samples_split': 4, 'criterion':'entropy', 'max_features': 3, 'max_depth: 3, 'class_weight': None

The RandomForest classifier is now trained on the test data to produce the model.


In [10]:
clf = RandomForestClassifier(
    bootstrap=True,
    min_samples_leaf=2,
    n_estimators=40,
    min_samples_split=4,
    criterion='entropy',
    max_features=3,
    max_depth=3,
    class_weight=None
)

# Train model on original training data
clf.fit(X, y)

# save model for future use

from sklearn.externals import joblib
joblib.dump(clf, '/home/jovyan/work/shared/data/models/vendor_classif_trained_Rdm_Forest.pkl.z')


Out[10]:
['/home/jovyan/work/shared/data/models/vendor_classif_trained_Rdm_Forest.pkl.z']

In [11]:
# Test loading

clf = joblib.load('/home/jovyan/work/shared/data/models/vendor_classif_trained_Rdm_Forest.pkl.z' )