Classification algorithms assign a discrete label to a vector, X based on the vector's elements. Unlike regression models, there are a wide array of classification models from probabilistic Bayesian and Logistic Regression models to non-parameteric methods like kNN and Random Forests to linear methods like SVMs and LDA.
As before the basic methodology is to select the best model through trial and evaluation with cross-validation and F1 Scores.
In [ ]:
# Using the IRIS data set - the classic classification data set.
from sklearn.cross_validation import train_test_split as tts
from sklearn.datasets import load_iris
from sklearn.metrics import classification_report
data = load_iris()
X_train, X_test, y_train, y_test = tts(data.data, data.target)
In [ ]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()
model.fit(X_train, y_train)
yhat = model.predict(X_test)
print(classification_report(yhat, y_test))
In [ ]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_train, y_train)
yhat = model.predict(X_test)
print(classification_report(yhat, y_test))
In [ ]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X_train, y_train)
yhat = model.predict(X_test)
print(classification_report(yhat, y_test))
In [ ]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
yhat = model.predict(X_test)
print(classification_report(yhat, y_test))
In [ ]:
from sklearn.svm import SVC
model = SVC()
model.fit(X_train, y_train)
yhat = model.predict(X_test)
print(classification_report(yhat, y_test))
In [ ]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
model = LinearDiscriminantAnalysis()
model.fit(X_train, y_train)
yhat = model.predict(X_test)
print(classification_report(yhat, y_test))
Downloaded from the UCI Machine Learning Repository on February 26, 2015. The first thing is to fully describe your data in a README file. The dataset description is as follows:
The examined group comprised kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each, randomly selected for the experiment. High quality visualization of the internal kernel structure was detected using a soft X-ray technique. It is non-destructive and considerably cheaper than other more sophisticated imaging techniques like scanning microscopy or laser technology. The images were recorded on 13x18 cm X-ray KODAK plates. Studies were conducted using combine harvested wheat grain originating from experimental fields, explored at the Institute of Agrophysics of the Polish Academy of Sciences in Lublin.
The data set can be used for the tasks of classification and cluster analysis.
To construct the data, seven geometric parameters of wheat kernels were measured:
All of these parameters were real-valued continuous.
M. Charytanowicz, J. Niewczas, P. Kulczycki, P.A. Kowalski, S. Lukasik, S. Zak, 'A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images', in: Information Technologies in Biomedicine, Ewa Pietka, Jacek Kawa (eds.), Springer-Verlag, Berlin-Heidelberg, 2010, pp. 15-24.
In this section we will begin to explore the dataset to determine relevant information.
In [ ]:
%matplotlib inline
import os
import json
import time
import pickle
import requests
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
In [ ]:
URL = "https://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt"
def fetch_data(fname='data/wheat/seeds_dataset.txt'):
"""
Helper method to retreive the ML Repository dataset.
"""
response = requests.get(URL)
outpath = os.path.abspath(fname)
with open(outpath, 'wb') as f:
f.write(response.content)
return outpath
# Fetch the data if required
# DATA = fetch_data()
In [ ]:
FEATURES = [
"area",
"perimeter",
"compactness",
"length",
"width",
"asymmetry",
"groove",
"label"
]
LABEL_MAP = {
1: "Kama",
2: "Rosa",
3: "Canadian",
}
# Read the data into a DataFrame
df = pd.read_csv(DATA, sep='\s+', header=None, names=FEATURES)
# Convert class labels into text
for k,v in LABEL_MAP.items():
df.ix[df.label == k, 'label'] = v
# Describe the dataset
print(df.describe())
In [ ]:
# Determine the shape of the data
print("{} instances with {} features\n".format(*df.shape))
# Determine the frequency of each class
print(df.groupby('label')['label'].count())
In [ ]:
# Create a scatter matrix of the dataframe features
from pandas.tools.plotting import scatter_matrix
scatter_matrix(df, alpha=0.2, figsize=(12, 12), diagonal='kde')
plt.show()
In [ ]:
from pandas.tools.plotting import parallel_coordinates
plt.figure(figsize=(12,12))
parallel_coordinates(df, 'label')
plt.show()
In [ ]:
from pandas.tools.plotting import radviz
plt.figure(figsize=(12,12))
radviz(df, 'label')
plt.show()
One way that we can structure our data for easy management is to save files on disk. The Scikit-Learn datasets are already structured this way, and when loaded into a Bunch
(a class imported from the datasets
module of Scikit-Learn) we can expose a data API that is very familiar to how we've trained on our toy datasets in the past. A Bunch
object exposes some important properties:
n_samples
* n_features
n_samples
Note: This does not preclude database storage of the data, in fact - a database can be easily extended to load the same Bunch
API. Simply store the README and features in a dataset description table and load it from there. The filenames property will be redundant, but you could store a SQL statement that shows the data load.
In order to manage our data set on disk, we'll structure our data as follows:
In [ ]:
from sklearn.datasets.base import Bunch
DATA_DIR = os.path.abspath(os.path.join(".", "..", "data", "wheat"))
# Show the contents of the data directory
for name in os.listdir(DATA_DIR):
if name.startswith("."): continue
print("- {}".format(name))
In [ ]:
def load_data(root=DATA_DIR):
# Construct the `Bunch` for the wheat dataset
filenames = {
'meta': os.path.join(root, 'meta.json'),
'rdme': os.path.join(root, 'README.md'),
'data': os.path.join(root, 'seeds_dataset.txt'),
}
# Load the meta data from the meta json
with open(filenames['meta'], 'r') as f:
meta = json.load(f)
target_names = meta['target_names']
feature_names = meta['feature_names']
# Load the description from the README.
with open(filenames['rdme'], 'r') as f:
DESCR = f.read()
# Load the dataset from the text file.
dataset = np.loadtxt(filenames['data'])
# Extract the target from the data
data = dataset[:, 0:-1]
target = dataset[:, -1]
# Create the bunch object
return Bunch(
data=data,
target=target,
filenames=filenames,
target_names=target_names,
feature_names=feature_names,
DESCR=DESCR
)
# Save the dataset as a variable we can use.
dataset = load_data()
print(dataset.data.shape)
print(dataset.target.shape)
In [ ]:
from sklearn import metrics
from sklearn import cross_validation
from sklearn.cross_validation import KFold
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
In [ ]:
def fit_and_evaluate(dataset, model, label, **kwargs):
"""
Because of the Scikit-Learn API, we can create a function to
do all of the fit and evaluate work on our behalf!
"""
start = time.time() # Start the clock!
scores = {'precision':[], 'recall':[], 'accuracy':[], 'f1':[]}
for train, test in KFold(dataset.data.shape[0], n_folds=12, shuffle=True):
X_train, X_test = dataset.data[train], dataset.data[test]
y_train, y_test = dataset.target[train], dataset.target[test]
estimator = model(**kwargs)
estimator.fit(X_train, y_train)
expected = y_test
predicted = estimator.predict(X_test)
# Append our scores to the tracker
scores['precision'].append(metrics.precision_score(expected, predicted, average="weighted"))
scores['recall'].append(metrics.recall_score(expected, predicted, average="weighted"))
scores['accuracy'].append(metrics.accuracy_score(expected, predicted))
scores['f1'].append(metrics.f1_score(expected, predicted, average="weighted"))
# Report
print("Build and Validation of {} took {:0.3f} seconds".format(label, time.time()-start))
print("Validation scores are as follows:\n")
print(pd.DataFrame(scores).mean())
# Write official estimator to disk
estimator = model(**kwargs)
estimator.fit(dataset.data, dataset.target)
outpath = label.lower().replace(" ", "-") + ".pickle"
with open(outpath, 'wb') as f:
pickle.dump(estimator, f)
print("\nFitted model written to:\n{}".format(os.path.abspath(outpath)))
In [ ]:
# Perform SVC Classification
fit_and_evaluate(dataset, SVC, "Wheat SVM Classifier")
In [ ]:
# Perform kNN Classification
fit_and_evaluate(dataset, KNeighborsClassifier, "Wheat kNN Classifier", n_neighbors=12)
In [ ]:
# Perform Random Forest Classification
fit_and_evaluate(dataset, RandomForestClassifier, "Wheat Random Forest Classifier")