*BLUF:
*QUESTIONS FOR INSTRUCTOR:
Input variables (based on physicochemical tests): 1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol
Output variable (based on sensory data): 12 - quality (score between 0 and 10)
Two datasets were created, using red and white wine samples. **I only test the red samples in this notebook.
Inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).
In the original ML analysis, several data mining methods were applied to model these datasets under a regression approach. The support vector machine model achieved the best results. Several metrics were computed: MAD, confusion matrix for a fixed error tolerance (T), etc.
Also, we plot the relative importances of the input variables (as measured by a sensitivity analysis procedure).
These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines.
Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.
Number of Instances: red wine - 1599; white wine - 4898.
Number of Attributes: 11 + output attribute
Note: several of the attributes may be correlated, thus it makes sense to apply some sort of feature selection.
Missing Attribute Values: None
In [31]:
%matplotlib inline
import os
import json
import time
import pickle
import requests
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
In [32]:
URL = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
def fetch_data(fname='winequality-red.csv'):
response = requests.get(URL)
outpath = os.path.abspath(fname)
with open(outpath, 'w') as f:
f.write(response.content)
return outpath
DATA = fetch_data()
In [33]:
FEATURES = [
"fixed acidity",
"volatile acidity",
"citric acid",
"residual sugar",
"chlorides",
"free sulfur dioxide",
"total sulfur dioxide",
"density",
"pH",
"sulphates",
"alcohol",
"quality"
]
# Label the wine quality rating
LABEL_MAP = {
1: 1,
2: 2,
3: 3,
4: 4,
5: 5,
6: 6,
7: 7,
8: 8,
9: 9,
10: 10,
}
In [34]:
df = pd.read_csv(DATA, sep=';', header=0, names=FEATURES)
# Add the labels to the data
for k,v in LABEL_MAP.items():
df.ix[df.quality == k, 'label'] = v
print df.head()
In [35]:
print df.describe()
In [36]:
# Determine shape of the data
print "{} instances with {} features\n".format(*df.shape)
# Determine frequency of each class
print df.groupby('label')['label'].count()
# Most red wine is considered a 5 or 6 out of 10 rating
In [53]:
# Check if there are any empty/missing values from the csv that would cause errors when we run the scikit-learn tools
df.isnull().values.any()
Out[53]:
In [38]:
# Just for fun, read in the white wine csv
df2 = pd.read_csv('../data/wine-quality/winequality-white.csv', sep=';', header=0, names=FEATURES)
df2.head()
Out[38]:
In [39]:
df2.isnull().values.any()
Out[39]:
In [40]:
from pandas.tools.plotting import scatter_matrix
scatter_matrix(df, alpha=0.2, figsize=(12,12), diagonal='kde')
plt.show()
In [41]:
from pandas.tools.plotting import parallel_coordinates
plt.figure(figsize=(12,12))
parallel_coordinates(df, 'label')
plt.show()
In [42]:
from pandas.tools.plotting import radviz
plt.figure(figsize=(12,12))
radviz(df, 'label')
plt.show()
In [43]:
df.to_csv('../data/wine-quality/winequality-red.csv')
In [44]:
from sklearn.datasets.base import Bunch
DATA_DIR = os.path.abspath(os.path.join(".", "..", "data", "wine-quality"))
for name in os.listdir(DATA_DIR):
if name.startswith("."): continue
print "-{}".format(name)
In [45]:
def load_data(root=DATA_DIR):
filenames = {
'data_red': os.path.join(root, 'winequality-red.csv'),
'data_white': os.path.join(root, 'winequality-white.csv'),
'data_red4': os.path.join(root, 'winequality-red4.csv'),
}
dataset = np.loadtxt(filenames['data_red'], delimiter=',', skiprows=1)
# Extract the target from the data
data = dataset[:, 1: -1]
target = dataset[:, -1]
# Create the Bunch object
return Bunch(
data=data,
target=target,
filenames=filenames
)
# Save dataset as a variable we can use
dataset = load_data()
print dataset.data
print dataset.target
In [46]:
print dataset.data.shape
print dataset.target.shape
In [47]:
from sklearn import metrics
from sklearn import cross_validation
from sklearn.cross_validation import KFold
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
In [48]:
def fit_and_eval(dataset, model, label, **kwargs):
start = time.time() #starts the clock
scores = {'precision':[], 'recall':[], 'accuracy':[], 'f1':[]}
for train, test in KFold(dataset.data.shape[0], n_folds=12, shuffle=True):
X_train, X_test = dataset.data[train], dataset.data[test]
y_train, y_test = dataset.target[train], dataset.target[test]
estimator = model(**kwargs)
estimator.fit(X_train, y_train)
expected = y_test
predicted = estimator.predict(X_test)
# append scores to tracker
scores['precision'].append(metrics.precision_score(expected, predicted, average='weighted'))
scores['recall'].append(metrics.recall_score(expected, predicted, average='weighted'))
scores['accuracy'].append(metrics.accuracy_score(expected, predicted))
scores['f1'].append(metrics.f1_score(expected, predicted, average='weighted'))
# report
print "Build and Validation took {:0.3f} seconds.".format(time.time()-start)
print "Validation scores are as follows:\n"
print pd.DataFrame(scores).mean()
# Write official estimator to disk
estimator = model(**kwargs)
estimator.fit(dataset.data, dataset.target)
outpath = label.lower().replace(" ", "-") + ".pickle"
with open(outpath, 'w') as f:
pickle.dump(estimator, f)
print "\nFitted model written to:\n{}".format(os.path.abspath(outpath))
In [49]:
fit_and_eval(dataset, SVC, "Wine SVM Classifier")
In [50]:
fit_and_eval(dataset, KNeighborsClassifier, "Wine kNN Classifier", n_neighbors=12)
In [52]:
fit_and_eval(dataset, RandomForestClassifier, "Wine Random Forest Classifier")
In [56]:
# Just for an experiment, what happens when I delete some of the features that,
# based on the parallel coordinates, don't have a clear delineation
# Read in the csv into a dataframe
df4 = pd.read_csv(DATA, sep=';', header=0, names=FEATURES)
# Label based on the wine quality rating
for k,v in LABEL_MAP.items():
df4.ix[df4.quality == k, 'label'] = v
# Delete features that don't have a clear delineation amongst the ratings
df4 = df4.drop(['free sulfur dioxide'], axis=1)
df4 = df4.drop(['total sulfur dioxide', 'residual sugar'], axis=1)
df4.head()
Out[56]:
In [57]:
plt.figure(figsize=(12,12))
parallel_coordinates(df4, 'label')
plt.show()
In [58]:
df4.to_csv('../data/wine-quality/winequality-red4.csv')
In [60]:
def load_data4(root=DATA_DIR):
filenames = {
'data_red': os.path.join(root, 'winequality-red.csv'),
'data_white': os.path.join(root, 'winequality-white.csv'),
'data_red4': os.path.join(root, 'winequality-red4.csv'),
}
dataset = np.loadtxt(filenames['data_red4'], delimiter=',', skiprows=1)
# Extract the target from the data
data = dataset[:, 1: -1]
target = dataset[:, -1]
# Create the Bunch object
return Bunch(
data=data,
target=target,
filenames=filenames
)
# Save dataset as a variable we can use
dataset = load_data4()
print dataset.data
print dataset.target
In [61]:
print dataset.data.shape
print dataset.target.shape
In [66]:
fit_and_eval(dataset, SVC, "Wine SVM Classifier v4")
In [63]:
fit_and_eval(dataset, KNeighborsClassifier, "Wine kNN Classifier v4", n_neighbors=12)
In [64]:
fit_and_eval(dataset, RandomForestClassifier, "Wine Random Forest Classifier v4")
In [ ]: