Can we predict how well wine will be recieved based on chemical makeup?

Kinda a neat question that Manu Jeevan did a writeup on.

We work with unbalanced classes, SVMs, and random forests in this example.

Imports


In [56]:
%matplotlib inline
import os
import numpy as np
import pandas as pd
import scipy as sp
import sklearn
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.cross_validation import cross_val_score
from sklearn.ensemble import RandomForestClassifier

Load data


In [57]:
dataDir = os.path.join(os.path.expanduser('~'),'data','ml','winequality')
wine_df = pd.read_csv(os.path.join(dataDir,'winequality-red.csv'), sep=';')
wine_df.head()


Out[57]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11 34 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25 67 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15 54 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17 60 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11 34 0.9978 3.51 0.56 9.4 5

Create the matrix and simplify the classification space


In [58]:
Y = wine_df.quality.values
wine_df = wine_df.drop('quality',axis=1)
print(Y[:10])

Y = np.asarray([1 if i>=7 else 0 for i in Y])
X = wine_df.as_matrix()
print X.shape
print(Y[:10])


[5 5 5 6 5 5 5 7 7 5]
(1599, 11)
[0 0 0 0 0 0 0 1 1 0]

Random Forest


In [61]:
scores = []
for val in range(1,21):
    clf = RandomForestClassifier(n_estimators=val)
    validated = cross_val_score(clf,X,Y,cv=10)
    scores.append(validated)

#print len(scores)
fig = plt.figure()
plt.clf()
ax = fig.add_subplot(111)
ax.boxplot(scores)
ax.set_ylim((0,1))
ax.set_xlim((0,21))
#sns.boxplot(scores)
plt.xlabel("number trees")
plt.ylabel("classification scores")
plt.title("classification score per number of trees")
plt.show()


Unbalanced design

Classification accuracy can be misleading in cases where we have an unbalanced design so use F1 scores instead.


In [60]:
scores = []
for val in range(1,21):
    clf = RandomForestClassifier(n_estimators=val)
    validated = cross_val_score(clf,X,Y,cv=10,scoring='f1')
    scores.append(validated)

fig = plt.figure()
plt.clf()
ax = fig.add_subplot(111)
ax.boxplot(scores)
ax.set_ylim((0,1))
ax.set_xlim((0,21))
plt.xlabel("number trees")
plt.ylabel("classification scores")
plt.title("classification score per number of trees")
plt.show()


20

In short , we don't see much gain by increasing the number of trees. The predict_proba function returns the probability for each class, but for many classifiers the accuracy of these values can become an issue if the class structure is highly unbalanced. So normally we call a class if this prob is >0.5, but we cannot trust this value so we can use cross-validation to find the best one.


In [80]:
print("total normals: %s/%s"%(np.where(Y==0)[0].size,Y.size))


total normals: 1382/1599

In [81]:
def cutoff_predict(clf,X,cutoff):
    return (clf.predict_proba(X)[:,1] > cutoff).astype(int)

scores = []
def custom_f1(cutoff):
    def f1_cutoff(clf,X,Y):
        ypred = cutoff_predict(clf,X,cutoff)
        return sklearn.metrics.f1_score(Y,ypred)
    
    return f1_cutoff

parmRange = np.arange(0.1,0.9,0.1)
for cutoff in parmRange:
    clf = RandomForestClassifier(n_estimators=15)
    validated = cross_val_score(clf,X,Y,cv=10,scoring=custom_f1(cutoff))
    scores.append(validated)
    
fig = plt.figure()
plt.clf()
ax = fig.add_subplot(111)
ax.boxplot(scores)
ax.set_ylim((0,1))
ax.set_xticklabels(parmRange)
plt.xlabel("cutoff value")
plt.ylabel("custom f1-score")
plt.title("fscores for each tree")
plt.show()


It is intuitive that the cutoff be less than 0.5 because the training data contains many fewer examples of 'good' wines, so we need to adjust the cutoff to reflect that good wines are more rare.

Plotting decision boundries

Random forests allos you to compute a heuristic for determining how important a feature is in predicting a target. Basically, the more the accuracy drops after sample permutation the more important the feature.


In [85]:
clf = RandomForestClassifier(n_estimators=15)
clf.fit(X,Y)
imp = clf.feature_importances_
names = wine_df.columns
imp,names = zip(*sorted(zip(imp,names)))
fig = plt.figure()
plt.clf()
ax = fig.add_subplot(111)
print np.array(list(imp)).sum()
ax.barh(range(len(names)),imp,align='center')
plt.yticks(range(len(names)),names)
plt.xlabel("Importance of features")
plt.ylabel("Features")
plt.title("Importance of each feature")
plt.show()


1.0

In [ ]: