Classifying Wheat Kernels by Physical Property

In the workshop for the previous week, you had to select a data set from the UCI Machine Learning Repository and based on the recommended analysis type, wrangle the data into a fitted model, showing some model evaluation. In particular:

  • Layout the data into a dataset X and targets y.
  • Choose regression, classification, or clustering and build the best model you can from it.
  • Report an evaluation of the model built
  • Visualize aspects of your model (optional)
  • Compare and contrast different model families

When complete, I will review your code, so please submit your code via pull-request to the Introduction to Machine Learning with Scikit-Learn repository!

Wheat Kernel Example

Downloaded from the UCI Machine Learning Repository on February 26, 2015. The first thing is to fully describe your data in a README file. The dataset description is as follows:

  • Data Set: Multivariate
  • Attribute: Real
  • Tasks: Classification, Clustering
  • Instances: 210
  • Attributes: 7

Data Set Information:

The examined group comprised kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each, randomly selected for the experiment. High quality visualization of the internal kernel structure was detected using a soft X-ray technique. It is non-destructive and considerably cheaper than other more sophisticated imaging techniques like scanning microscopy or laser technology. The images were recorded on 13x18 cm X-ray KODAK plates. Studies were conducted using combine harvested wheat grain originating from experimental fields, explored at the Institute of Agrophysics of the Polish Academy of Sciences in Lublin.

The data set can be used for the tasks of classification and cluster analysis.

Attribute Information:

To construct the data, seven geometric parameters of wheat kernels were measured:

  1. area A,
  2. perimeter P,
  3. compactness C = 4piA/P^2,
  4. length of kernel,
  5. width of kernel,
  6. asymmetry coefficient
  7. length of kernel groove.

All of these parameters were real-valued continuous.

Relevant Papers:

M. Charytanowicz, J. Niewczas, P. Kulczycki, P.A. Kowalski, S. Lukasik, S. Zak, 'A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images', in: Information Technologies in Biomedicine, Ewa Pietka, Jacek Kawa (eds.), Springer-Verlag, Berlin-Heidelberg, 2010, pp. 15-24.

Data Exploration

In this section we will begin to explore the dataset to determine relevant information.


In [1]:
%matplotlib notebook

import os
import json
import time
import pickle
import requests

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
URL = "https://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt"

def fetch_data(fname='seeds_dataset.txt'):
    """
    Helper method to retreive the ML Repository dataset.
    """
    response = requests.get(URL)
    outpath  = os.path.abspath(fname)
    with open(outpath, 'wb') as f:
        f.write(response.content)
    
    return outpath

# Fetch the data if required
DATA = fetch_data()

In [3]:
FEATURES  = [
    "area",
    "perimeter",
    "compactness",
    "length",
    "width",
    "asymmetry",
    "groove",
    "label"
]

LABEL_MAP = {
    1: "Kama",
    2: "Rosa",
    3: "Canadian",
}

# Read the data into a DataFrame
df = pd.read_csv(DATA, sep='\s+', header=None, names=FEATURES)

# Convert class labels into text
df["label"] = df["label"].map(LABEL_MAP)

# Describe the dataset
print(df.describe())


             area   perimeter  compactness      length       width  \
count  210.000000  210.000000   210.000000  210.000000  210.000000   
mean    14.847524   14.559286     0.870999    5.628533    3.258605   
std      2.909699    1.305959     0.023629    0.443063    0.377714   
min     10.590000   12.410000     0.808100    4.899000    2.630000   
25%     12.270000   13.450000     0.856900    5.262250    2.944000   
50%     14.355000   14.320000     0.873450    5.523500    3.237000   
75%     17.305000   15.715000     0.887775    5.979750    3.561750   
max     21.180000   17.250000     0.918300    6.675000    4.033000   

        asymmetry      groove  
count  210.000000  210.000000  
mean     3.700201    5.408071  
std      1.503557    0.491480  
min      0.765100    4.519000  
25%      2.561500    5.045000  
50%      3.599000    5.223000  
75%      4.768750    5.877000  
max      8.456000    6.550000  

In [4]:
# Determine the shape of the data
print("{} instances with {} features\n".format(*df.shape))

# Determine the frequency of each class
print(df.groupby('label')['label'].count())


210 instances with 8 features

label
Canadian    70
Kama        70
Rosa        70
Name: label, dtype: int64

In [5]:
from sklearn.preprocessing import LabelEncoder

# Extract our X and y data
X = df[FEATURES[:-1]]
y = df["label"]

# Encode our target variable
encoder = LabelEncoder().fit(y)
y = encoder.transform(y)

print(X.shape, y.shape)


(210, 7) (210,)

In [6]:
# Create a scatter matrix of the dataframe features
from pandas.plotting import scatter_matrix
scatter_matrix(X, alpha=0.2, figsize=(8, 8), diagonal='kde')
plt.show()



In [9]:
from yellowbrick.features import ParallelCoordinates

oz = ParallelCoordinates(classes=encoder.classes_, normalize='standard').fit(X, y)
_ = oz.show()



In [10]:
from yellowbrick.features import RadViz

oz = RadViz(classes=encoder.classes_, alpha=0.35).fit(X, y)
_ = oz.show()


Data Extraction

One way that we can structure our data for easy management is to save files on disk. The Scikit-Learn datasets are already structured this way, and when loaded into a Bunch (a class imported from the datasets module of Scikit-Learn) we can expose a data API that is very familiar to how we've trained on our toy datasets in the past. A Bunch object exposes some important properties:

  • data: array of shape n_samples * n_features
  • target: array of length n_samples
  • feature_names: names of the features
  • target_names: names of the targets
  • filenames: names of the files that were loaded
  • DESCR: contents of the readme

Note: This does not preclude database storage of the data, in fact - a database can be easily extended to load the same Bunch API. Simply store the README and features in a dataset description table and load it from there. The filenames property will be redundant, but you could store a SQL statement that shows the data load.

IMPORTANT: for the below code to work you need to unzip wheat.zip in the data folder at the top level of the repository.

In order to manage our data set on disk, we'll structure our data as follows:


In [13]:
from sklearn.datasets.base import Bunch

DATA_DIR = os.path.abspath(os.path.join( "..", "data", "wheat"))

# Show the contents of the data directory
for name in os.listdir(DATA_DIR):
    if name.startswith("."): continue
    print("- {}".format(name))


- wrangle.py
- README.md
- dataset.csv
- seeds_dataset.txt
- meta.json

In [14]:
def load_data(root=DATA_DIR):
    # Construct the `Bunch` for the wheat dataset
    filenames     = {
        'meta': os.path.join(root, 'meta.json'),
        'rdme': os.path.join(root, 'README.md'),
        'data': os.path.join(root, 'seeds_dataset.txt'),
    }

    # Load the meta data from the meta json
    with open(filenames['meta'], 'r') as f:
        meta = json.load(f)
        target_names  = meta['target_names']
        feature_names = meta['feature_names']

    # Load the description from the README. 
    with open(filenames['rdme'], 'r') as f:
        DESCR = f.read()

    # Load the dataset from the text file.
    dataset = np.loadtxt(filenames['data'])

    # Extract the target from the data
    data   = dataset[:, 0:-1]
    target = dataset[:, -1]

    # Create the bunch object
    return Bunch(
        data=data,
        target=target,
        filenames=filenames,
        target_names=target_names,
        feature_names=feature_names,
        DESCR=DESCR
    )

# Save the dataset as a variable we can use.
dataset = load_data()

print(dataset.data.shape)
print(dataset.target.shape)


(210, 7)
(210,)

Classification

Now that we have a dataset Bunch loaded and ready, we can begin the classification process. Let's attempt to build a classifier with kNN, SVM, and Random Forest classifiers.


In [15]:
def get_internal_params(model):
    for attr in dir(model):
        if attr.endswith("_") and not attr.startswith("_"):
            print(attr, getattr(model, attr))

In [16]:
from sklearn.tree import DecisionTreeClassifier

In [17]:
model = DecisionTreeClassifier()
get_internal_params(model)


---------------------------------------------------------------------------
NotFittedError                            Traceback (most recent call last)
<ipython-input-17-a32a78525c8a> in <module>
      1 model = DecisionTreeClassifier()
----> 2 get_internal_params(model)

<ipython-input-15-e1dedf3bd5f6> in get_internal_params(model)
      2     for attr in dir(model):
      3         if attr.endswith("_") and not attr.startswith("_"):
----> 4             print(attr, getattr(model, attr))

~/.pyenv/versions/3.7.3/Python.framework/Versions/3.7/lib/python3.7/site-packages/sklearn/tree/_classes.py in feature_importances_(self)
    574             (Gini importance).
    575         """
--> 576         check_is_fitted(self)
    577 
    578         return self.tree_.compute_feature_importances()

~/.pyenv/versions/3.7.3/Python.framework/Versions/3.7/lib/python3.7/site-packages/sklearn/utils/validation.py in check_is_fitted(estimator, attributes, msg, all_or_any)
    965 
    966     if not attrs:
--> 967         raise NotFittedError(msg % {'name': type(estimator).__name__})
    968 
    969 

NotFittedError: This DecisionTreeClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

In [18]:
model.fit(X,y)
get_internal_params(model)


classes_ [0 1 2]
feature_importances_ [0.35574451 0.00714286 0.01760204 0.01190476 0.02042607 0.05540911
 0.53177066]
max_features_ 7
n_classes_ 3
n_features_ 7
n_outputs_ 1
tree_ <sklearn.tree._tree.Tree object at 0x12c52e988>

In [19]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression().fit(X, y)
get_internal_params(model)


classes_ [0 1 2]
coef_ [[-2.09372944  0.95962048  0.30106853  0.13872252  0.47104081  0.49850611
   1.69398472]
 [ 0.04772051  0.58676923  0.12323364  1.29662165  0.24189993 -0.55883534
  -2.85286682]
 [ 2.04600893 -1.54638972 -0.42430217 -1.43534416 -0.71294073  0.06032923
   1.1588821 ]]
intercept_ [ 2.1874024  1.1743927 -3.3617951]
n_iter_ [100]
/Users/benjamin/.pyenv/versions/3.7.3/Python.framework/Versions/3.7/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)

In [20]:
from sklearn import metrics

from sklearn.model_selection import KFold

from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

In [21]:
def fit_and_evaluate(X, y, model, label, **kwargs):
    """
    Because of the Scikit-Learn API, we can create a function to
    do all of the fit and evaluate work on our behalf!
    """
    start  = time.time() # Start the clock! 
    scores = {'precision':[], 'recall':[], 'accuracy':[], 'f1':[]}
    
    kf = KFold(n_splits = 12, shuffle=True)
    
    for train, test in kf.split(X, y):
        X_train, X_test = X.iloc[train], X.iloc[test]
        y_train, y_test = y[train], y[test]
        
        estimator = model(**kwargs) 
        estimator.fit(X_train, y_train)
        
        expected  = y_test
        predicted = estimator.predict(X_test)
        
        # Append our scores to the tracker
        scores['precision'].append(metrics.precision_score(expected, predicted, average="weighted"))
        scores['recall'].append(metrics.recall_score(expected, predicted, average="weighted"))
        scores['accuracy'].append(metrics.accuracy_score(expected, predicted))
        scores['f1'].append(metrics.f1_score(expected, predicted, average="weighted"))

    # Report
    print("Build and Validation of {} took {:0.3f} seconds".format(label, time.time()-start))
    print("Validation scores are as follows:\n")
    print(pd.DataFrame(scores).mean())
    
    # Write official estimator to disk
    estimator = model(**kwargs)
    estimator.fit(X, y)
    
    outpath = label.lower().replace(" ", "-") + ".pickle"
    with open(outpath, 'wb') as f:
        pickle.dump(estimator, f)

    print("\nFitted model written to:\n{}".format(os.path.abspath(outpath)))

In [22]:
# Perform SVC Classification

fit_and_evaluate(X, y, SVC, "Wheat SVM Classifier", gamma = 'auto')


Build and Validation of Wheat SVM Classifier took 0.065 seconds
Validation scores are as follows:

precision    0.918864
recall       0.904684
accuracy     0.904684
f1           0.903441
dtype: float64

Fitted model written to:
/Users/benjamin/Workspace/georgetown/machine-learning/notebooks/wheat-svm-classifier.pickle

In [23]:
# Perform kNN Classification
fit_and_evaluate(X, y, KNeighborsClassifier, "Wheat kNN Classifier", n_neighbors=12)


Build and Validation of Wheat kNN Classifier took 0.071 seconds
Validation scores are as follows:

precision    0.923123
recall       0.909586
accuracy     0.909586
f1           0.909621
dtype: float64

Fitted model written to:
/Users/benjamin/Workspace/georgetown/machine-learning/notebooks/wheat-knn-classifier.pickle

In [24]:
# Perform Random Forest Classification
fit_and_evaluate(X, y, RandomForestClassifier, "Wheat Random Forest Classifier")


Build and Validation of Wheat Random Forest Classifier took 1.535 seconds
Validation scores are as follows:

precision    0.953871
recall       0.943627
accuracy     0.943627
f1           0.943543
dtype: float64

Fitted model written to:
/Users/benjamin/Workspace/georgetown/machine-learning/notebooks/wheat-random-forest-classifier.pickle