Choose the best classification algorithm

Use a k-fold cross-validation to choose the best classification algorithm

From the scikit-learn documentation concerning k-fold cross-validation:

To avoid it ["overfitting"], it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set X_test, y_test.

In the basic approach, called k-fold CV, the training set is split into k smaller sets... The following procedure is followed for each of the k “folds”:

  • A model is trained using k-1 of the folds as training data;
  • the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).

The following code uses this technique to evaluate the relative performance of various ML classification algorithms on the training data.

RandomForest is one of the best choices.


In [10]:
# Initialize

import pandas as pd
import numpy as np
import pip #needed to use the pip functions

# Show versions of all installed software to help debug incompatibilities.

for i in pip.get_installed_distributions(local_only=True):
    print(i)


zict 0.1.2
xmltodict 0.11.0
xlrd 1.0.0
widgetsnbextension 2.0.0
wheel 0.29.0
webencodings 0.5
wcwidth 0.1.7
vincent 0.4.4
urllib3 1.21.1
traitlets 4.3.2
tornado 4.5.1
toolz 0.8.2
testpath 0.3
terminado 0.6
tblib 1.3.2
sympy 1.0
subprocess32 3.2.7
statsmodels 0.8.0
SQLAlchemy 1.1.11
sortedcontainers 1.5.3
six 1.10.0
singledispatch 3.4.0.3
simplegeneric 0.8.1
setuptools 36.2.0
seaborn 0.7.1
scipy 0.19.1
scikit-learn 0.18.2
scikit-image 0.12.3
schedule 0.4.3
scandir 1.5
responses 0.5.1
requests 2.18.1
pyzmq 16.0.2
PyYAML 3.12
pytz 2017.2
python-Levenshtein 0.12.0
python-dateutil 2.6.0
pytest 3.1.3
PySocks 1.6.7
pyparsing 2.2.0
pyOpenSSL 16.2.0
Pygments 2.2.0
pycparser 2.18
py 1.4.34
ptyprocess 0.5.2
psutil 5.2.1
prompt-toolkit 1.0.14
pip 9.0.1
Pillow 4.2.1
pickleshare 0.7.3
pexpect 4.2.1
pbr 3.1.1
patsy 0.4.1
pathlib2 2.3.0
partd 0.3.8
pandocfilters 1.4.1
pandas 0.19.2
olefile 0.44
numpy 1.12.1
numexpr 2.6.2
numba 0.31.0+0.g3bb1d98.dirty
notebook 5.0.0
networkx 1.11
nbformat 4.3.0
nbconvert 5.2.1
msgpack-python 0.4.8
mpmath 0.19
mock 2.0.0
mistune 0.7.4
matplotlib 2.0.2
MarkupSafe 1.0
locket 0.2.0
llvmlite 0.16.0
jupyter-core 4.3.0
jupyter-client 5.1.0
jsonschema 2.5.1
Jinja2 2.9.5
ipywidgets 6.0.0
ipython 5.3.0
ipython-genutils 0.2.0
ipykernel 4.6.1
ipaddress 1.0.18
idna 2.5
html5lib 0.9999999
heapdict 1.0.0
h5py 2.6.0
fuzzywuzzy 0.15.0
futures 3.0.5
functools32 3.2.3.post2
funcsigs 1.0.2
fastcache 1.0.2
enum34 1.1.6
entrypoints 0.2.3
distributed 1.18.0
dill 0.2.6
decorator 4.0.11
dask 0.15.1
Cython 0.25.2
cryptography 1.9
cookies 2.2.1
configparser 3.5.0
cloudpickle 0.2.2
click 6.7
chardet 3.0.4
cffi 1.10.0
certifi 2017.4.17
bokeh 0.12.6
bleach 1.5.0
bkcharts 0.2
beautifulsoup4 4.5.3
backports.ssl-match-hostname 3.5.0.1
backports.shutil-get-terminal-size 1.0.0
backports-abc 0.5
asn1crypto 0.22.0
cycler 0.10.0

Read in the vendor training data

Read in the manually labelled vendor training data.

Format it and convert to two numpy arrays for input to the scikit-learn ML algorithm.


In [11]:
try:
    df_label_vendors = pd.io.parsers.read_csv(
                            "/home/jovyan/work/shared/data/csv/label_vendors.csv",
                            error_bad_lines=False,
                            warn_bad_lines=True,
                            quotechar='"',
                            encoding='utf-8')
except IOError as e:
    print('\n\n***I/O error({0}): {1}\n\n'.format(
                e.errno, e.strerror))

# except ValueError:
#    self.logger.critical('Could not convert data to an integer.')
except:
    print(
        '\n\n***Unexpected error: {0}\n\n'.format(
            sys.exc_info()[0]))
    raise

# Number of records / columns

df_label_vendors.shape


Out[11]:
(10110, 13)

In [12]:
# Format training data as "X" == "features, "y" == target.
# The target value is the 1st column.
df_match_train1 = df_label_vendors[['match','fz_ptl_ratio', 'fz_ptl_tok_sort_ratio', 'fz_ratio', 'fz_tok_set_ratio', 'fz_uwratio','ven_len', 'pu0_len']]

# Convert into 2 numpy arrays for the scikit-learn ML classification algorithms.
np_match_train1 = np.asarray(df_match_train1)
X, y = np_match_train1[:, 1:], np_match_train1[:, 0]

print(X.shape, y.shape)


((10110, 7), (10110,))

In [13]:
# set up for k-fold cross-validation to choose best model

#rom sklearn import cross_validation
from sklearn.model_selection import cross_val_score

from sklearn.linear_model import RidgeClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestCentroid
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB



for clf, clf_name in (
		(RidgeClassifier(alpha=1.0), "Ridge Classifier"),
        (RidgeClassifier(tol=1e-2, solver="lsqr"), "Ridge Classifier #2"),
        (Perceptron(n_iter=50), "Perceptron"),
        (PassiveAggressiveClassifier(n_iter=50), "Passive-Aggressive"),
        (KNeighborsClassifier(n_neighbors=10), "kNN"),
        (NearestCentroid(), "Nearest Centroid"),
        (RandomForestClassifier(n_estimators=100, class_weight="auto"), "Random forest"),
		(SGDClassifier(alpha=.0001, n_iter=50, penalty="l2"), "SGD / SVM"),
		(MultinomialNB(alpha=.01), "Naive Bayes")):

	scores = cross_val_score(clf, X, y, cv=5)
	print("%s, Accuracy: %0.2f (+/- %0.2f)" % (clf_name, scores.mean(), scores.std() * 2))


Ridge Classifier, Accuracy: 0.97 (+/- 0.02)
Ridge Classifier #2, Accuracy: 0.97 (+/- 0.02)
Perceptron, Accuracy: 0.93 (+/- 0.05)
Passive-Aggressive, Accuracy: 0.93 (+/- 0.03)
kNN, Accuracy: 0.98 (+/- 0.01)
Nearest Centroid, Accuracy: 0.90 (+/- 0.04)
Random forest, Accuracy: 0.98 (+/- 0.01)
SGD / SVM, Accuracy: 0.89 (+/- 0.21)
Naive Bayes, Accuracy: 0.78 (+/- 0.15)