Train vendor matching algorithm

The ML classification algorithm has to be retrained from time to time e.g. when scikit-learn undergoes a major release upgrade.

The specific algorithm used is a RandomForest classifier. In initial testing using a k-fold cross-validation approach, this algorithm outperformed several other simple classification algorithms

The training proceeds as follows:

First the algorithm is tuned for typical data by using a grid search.
Next the ML classifier is run on the training data using the optimum parameters.
Finally the trained model is stored for future use.



In [1]:

    
# Initialize

import pandas as pd
import numpy as np
import pip #needed to use the pip functions

# Show versions of all installed software to help debug incompatibilities.

for i in pip.get_installed_distributions(local_only=True):
    print(i)









    



zict 0.1.2
xmltodict 0.11.0
xlrd 1.0.0
widgetsnbextension 2.0.0
wheel 0.29.0
webencodings 0.5
wcwidth 0.1.7
vincent 0.4.4
urllib3 1.21.1
traitlets 4.3.2
tornado 4.5.1
toolz 0.8.2
testpath 0.3
terminado 0.6
tblib 1.3.2
sympy 1.0
subprocess32 3.2.7
statsmodels 0.8.0
SQLAlchemy 1.1.11
sortedcontainers 1.5.3
six 1.10.0
singledispatch 3.4.0.3
simplegeneric 0.8.1
setuptools 36.2.0
seaborn 0.7.1
scipy 0.19.1
scikit-learn 0.18.2
scikit-image 0.12.3
schedule 0.4.3
scandir 1.5
responses 0.5.1
requests 2.18.1
pyzmq 16.0.2
PyYAML 3.12
pytz 2017.2
python-Levenshtein 0.12.0
python-dateutil 2.6.0
pytest 3.1.3
PySocks 1.6.7
pyparsing 2.2.0
pyOpenSSL 16.2.0
Pygments 2.2.0
pycparser 2.18
py 1.4.34
ptyprocess 0.5.2
psutil 5.2.1
prompt-toolkit 1.0.14
pip 9.0.1
Pillow 4.2.1
pickleshare 0.7.3
pexpect 4.2.1
pbr 3.1.1
patsy 0.4.1
pathlib2 2.3.0
partd 0.3.8
pandocfilters 1.4.1
pandas 0.19.2
olefile 0.44
numpy 1.12.1
numexpr 2.6.2
numba 0.31.0+0.g3bb1d98.dirty
notebook 5.0.0
networkx 1.11
nbformat 4.3.0
nbconvert 5.2.1
msgpack-python 0.4.8
mpmath 0.19
mock 2.0.0
mistune 0.7.4
matplotlib 2.0.2
MarkupSafe 1.0
locket 0.2.0
llvmlite 0.16.0
jupyter-core 4.3.0
jupyter-client 5.1.0
jsonschema 2.5.1
Jinja2 2.9.5
ipywidgets 6.0.0
ipython 5.3.0
ipython-genutils 0.2.0
ipykernel 4.6.1
ipaddress 1.0.18
idna 2.5
html5lib 0.9999999
heapdict 1.0.0
h5py 2.6.0
fuzzywuzzy 0.15.0
futures 3.0.5
functools32 3.2.3.post2
funcsigs 1.0.2
fastcache 1.0.2
enum34 1.1.6
entrypoints 0.2.3
distributed 1.18.0
dill 0.2.6
decorator 4.0.11
dask 0.15.1
Cython 0.25.2
cryptography 1.9
cookies 2.2.1
configparser 3.5.0
cloudpickle 0.2.2
click 6.7
chardet 3.0.4
cffi 1.10.0
certifi 2017.4.17
bokeh 0.12.6
bleach 1.5.0
bkcharts 0.2
beautifulsoup4 4.5.3
backports.ssl-match-hostname 3.5.0.1
backports.shutil-get-terminal-size 1.0.0
backports-abc 0.5
asn1crypto 0.22.0
cycler 0.10.0

Read in the vendor training data

Read in the manually labelled vendor training data.

Format it and convert to two numpy arrays for input to the scikit-learn ML algorithm.



In [2]:

    
try:
    df_label_vendors = pd.io.parsers.read_csv(
                            "/home/jovyan/work/shared/data/csv/label_vendors.csv",
                            error_bad_lines=False,
                            warn_bad_lines=True,
                            quotechar='"',
                            encoding='utf-8')
except IOError as e:
    print('\n\n***I/O error({0}): {1}\n\n'.format(
                e.errno, e.strerror))

# except ValueError:
#    self.logger.critical('Could not convert data to an integer.')
except:
    print(
        '\n\n***Unexpected error: {0}\n\n'.format(
            sys.exc_info()[0]))
    raise

# Number of records / columns

df_label_vendors.shape









    Out[2]:





(10110, 13)



In [3]:

    
# Print out some sample values

df_label_vendors.sample(5)









    Out[3]:






  
    
      
      Unnamed: 0
      fz_ptl_ratio
      fz_ptl_tok_sort_ratio
      fz_ratio
      fz_tok_set_ratio
      fz_uwratio
      pub0_cln
      publisher0
      ven_cln
      vendor_X
      match
      ven_len
      pu0_len
    
  
  
    
      2245
      2245
      83
      80
      69
      82
      85
      convert audio free
      convert audio free
      convert in
      convert-in
      0
      10
      18
    
    
      4956
      4956
      100
      100
      100
      100
      100
      lavasoft
      lavasoft
      lavasoft
      lavasoft
      1
      8
      8
    
    
      2029
      2029
      80
      100
      48
      100
      90
      blue cat audio
      blue cat audio
      cat
      cat
      0
      3
      14
    
    
      2715
      2715
      61
      54
      62
      55
      55
      hanbit soft
      hanbit soft
      driver soft
      driver-soft
      0
      11
      11
    
    
      7511
      7511
      71
      100
      27
      21
      60
      copyright 2013 esupport all rights reserved
      copyright © 2013 esupport.com • all rights res...
      serve
      serve
      0
      5
      54



In [4]:

    
# Check that all rows are labelled

# (Should return "False")

df_label_vendors['match'].isnull().any()









    Out[4]:





False



In [5]:

    
# Format training data as "X" == "features, "y" == target.
# The target value is the 1st column.
df_match_train1 = df_label_vendors[['match','fz_ptl_ratio', 'fz_ptl_tok_sort_ratio', 'fz_ratio', 'fz_tok_set_ratio', 'fz_uwratio','ven_len', 'pu0_len']]

# Convert into 2 numpy arrays for the scikit-learn ML classification algorithms.
np_match_train1 = np.asarray(df_match_train1)
X, y = np_match_train1[:, 1:], np_match_train1[:, 0]

print(X.shape, y.shape)









    



((10110, 7), (10110,))

Use a grid search to tune the ML algorithm

Once the best algorithm has been determined, it should be tuned for optimal performance with the data.

This is done using a grid search. From the scikit-learn documentation:

Parameters that are not directly learnt within estimators can be set by searching a parameter space for the best Cross-validation: evaluating estimator performance score... Any parameter provided when constructing an estimator may be optimized in this manner.

Rather than do a compute-intensive search of the entire parameter space, a randomized search is done to find reasonably efficient parameters.

This code was modified from the scikit-learn sample code.



In [6]:

    
#	Now find optimum parameters for model using Grid Search

from time import time
from scipy.stats import randint as sp_randint

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

# build a classifier
clf = RandomForestClassifier()

# Utility function to report best scores
def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")
            

# specify parameters and distributions to sample from
param_dist = {"n_estimators": sp_randint(20, 100),
              "max_depth": [3, None],
              "max_features": sp_randint(1,7),
              "min_samples_split": sp_randint(2,7),
              "min_samples_leaf": sp_randint(1, 7),
              "bootstrap": [True, False],
              "class_weight": ['auto', None],
              "criterion": ["gini", "entropy"]}

# run randomized search
n_iter_search = 40
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                                   n_iter=n_iter_search)

start = time()
random_search.fit(X, y)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time() - start), n_iter_search))
report(random_search.cv_results_)









    



RandomizedSearchCV took 54.38 seconds for 40 candidates parameter settings.
Model with rank: 1
Mean validation score: 0.980 (std: 0.005)
Parameters: {'bootstrap': True, 'min_samples_leaf': 4, 'n_estimators': 30, 'min_samples_split': 2, 'criterion': 'entropy', 'max_features': 3, 'max_depth': 3, 'class_weight': None}

Model with rank: 2
Mean validation score: 0.980 (std: 0.005)
Parameters: {'bootstrap': False, 'min_samples_leaf': 2, 'n_estimators': 39, 'min_samples_split': 6, 'criterion': 'gini', 'max_features': 4, 'max_depth': 3, 'class_weight': None}

Model with rank: 2
Mean validation score: 0.980 (std: 0.005)
Parameters: {'bootstrap': False, 'min_samples_leaf': 1, 'n_estimators': 84, 'min_samples_split': 6, 'criterion': 'entropy', 'max_features': 3, 'max_depth': 3, 'class_weight': None}

Run the ML classifier with optimum parameters on the test data

Based on the above, and ignoring default values, the optimum set of parameters would be something like the following:

'bootstrap':True, 'min_samples_leaf': 2, 'n_estimators': 40, 'min_samples_split': 4, 'criterion':'entropy', 'max_features': 3, 'max_depth: 3, 'class_weight': None

The RandomForest classifier is now trained on the test data to produce the model.



In [10]:

    
clf = RandomForestClassifier(
    bootstrap=True,
    min_samples_leaf=2,
    n_estimators=40,
    min_samples_split=4,
    criterion='entropy',
    max_features=3,
    max_depth=3,
    class_weight=None
)

# Train model on original training data
clf.fit(X, y)

# save model for future use

from sklearn.externals import joblib
joblib.dump(clf, '/home/jovyan/work/shared/data/models/vendor_classif_trained_Rdm_Forest.pkl.z')









    Out[10]:





['/home/jovyan/work/shared/data/models/vendor_classif_trained_Rdm_Forest.pkl.z']



In [11]:

    
# Test loading

clf = joblib.load('/home/jovyan/work/shared/data/models/vendor_classif_trained_Rdm_Forest.pkl.z' )

	Unnamed: 0	fz_ptl_ratio	fz_ptl_tok_sort_ratio	fz_ratio	fz_tok_set_ratio	fz_uwratio	pub0_cln	publisher0	ven_cln	vendor_X	match	ven_len	pu0_len
2245	2245	83	80	69	82	85	convert audio free	convert audio free	convert in	convert-in	0	10	18
4956	4956	100	100	100	100	100	lavasoft	lavasoft	lavasoft	lavasoft	1	8	8
2029	2029	80	100	48	100	90	blue cat audio	blue cat audio	cat	cat	0	3	14
2715	2715	61	54	62	55	55	hanbit soft	hanbit soft	driver soft	driver-soft	0	11	11
7511	7511	71	100	27	21	60	copyright 2013 esupport all rights reserved	copyright © 2013 esupport.com • all rights res...	serve	serve	0	5	54