In this tutorial we will be analyzing a data set on wine quality taken from the UC Irvine Machine Learning Repository. The data consists of a chemical analysis of many types of wine and each is given a quality score. You can read more about the data here. Some math concepts in this tutorial will not be covered in detail. Please see the Math Primer in the GitHub Repo for an excellent discussion of common math concepts in machine learning.
In [1]:
import numpy as np
import pandas as pd
Load the data using Pandas:
In [2]:
red_wine = pd.read_csv('winequality-red.csv',sep=';')
Let's take a look at the data. A good way to do this is using the info()
, head()
, and describe()
functions in Pandas.
In [3]:
red_wine.head() # This command displays the column headings and first five rows of data.
Out[3]:
In [4]:
red_wine.describe().T # This command displayes statistics about each column with numerical data.
Out[4]:
In [5]:
red_wine.info() # This command displays the column types and size for the data frame.
We will focus on the red wines for this study. Let's see how the red wines range in quality score...
To better understand the data, we will use a scatter plot matrix. This is a plot comparing all values in the data set by plotting one value versus the other. All features are labeled both on the x-axis and the y-axis. If, for example, you want to compare citric acid
and sulphates
you would read off citric acid
from the x-axis and then go up to sulphates
on the y-axis to find the plot that compares these two features. If you compare the same feature with itself you will see a histogram showing it's distribution. It provides a good way to see how the different features are related to eachother.
We will use the Python visualization library Seaborn to make these plots. Seaborn is a great tool for high level statistical graphics.
In [6]:
import seaborn as sb
sb.set_context("notebook", font_scale=2.5)
from matplotlib import pyplot as plt
%matplotlib inline
In [7]:
sb.pairplot(red_wine, size=3)
Out[7]:
In [8]:
from IPython.display import Image
Image(filename='images/fig1.png')
Out[8]:
In [9]:
red_wine[['total sulfur dioxide', 'sulphates']].describe().T
Out[9]:
To exclude the outliers I defined the function below. It takes as input a dataframe, a threshold defined as number of standard deviations from the mean, and which columns you want to 'clean'.
In [10]:
def outliers(df, threshold, columns):
for col in columns:
mask = df[col] > float(threshold)*df[col].std()+df[col].mean()
df.loc[mask == True,col] = np.nan
mean_property = df.loc[:,col].mean()
df.loc[mask == True,col] = mean_property
return df
In [11]:
column_list = red_wine.columns.tolist() # Save the column names for the wine dataframe to a list.
Below, I set the threshold to five. Any value more than five standard deviations from the mean will be labeled as an outlier.
In [12]:
threshold = 5
In [13]:
red_wine_cleaned = red_wine.copy()
red_wine_cleaned = outliers(red_wine_cleaned, 5, column_list[0:-1])
red_wine_cleaned.describe().T
Out[13]:
Now let's examine the data again with the outliers removed.
In [14]:
sb.pairplot(red_wine_cleaned, size=3)
Out[14]:
Let's compare the two features total sulfur dioxide
and sulphates
again, with and with out outliers, to see the difference.
In [15]:
sb.set_context("notebook", font_scale=1)
pp = sb.pairplot(red_wine[['total sulfur dioxide', 'sulphates']], size=3)
plt.subplots_adjust(top=0.9)
pp.fig.suptitle('With Outliers', fontsize=20, verticalalignment='top')
Out[15]:
In [16]:
sb.set_context("notebook", font_scale=1)
pp = sb.pairplot(red_wine_cleaned[['total sulfur dioxide', 'sulphates']], size=3)
plt.subplots_adjust(top=0.9)
pp.fig.suptitle('Without Outliers', fontsize=20, verticalalignment='top')
Out[16]:
Now we will bin the data to define categories. Our model will try to infer the category given the various chemical properties meassured for the wine dataset.
In [17]:
print("The range is wine quality is {0}".format(np.sort(red_wine_cleaned['quality'].unique())))
First, we will bin the data into three bins based on their quality, 'Bad', 'Average', and 'Good'.
In [18]:
bins = [3, 5, 6, 8]
red_wine_cleaned['category'] = pd.cut(red_wine_cleaned.quality, bins, labels=['Bad', 'Average', 'Good'])
In [19]:
sb.pairplot(red_wine_cleaned.drop(['quality'],1),hue='category', size=3)
Out[19]:
We can examine some metrics by category using the Pandas routines groupby()
and agg()
. I won't discuss these routines in the tutorial, but if you want to learn more, please read the Pandas documentation.
In [20]:
red_wine_cleaned.drop('quality',1).groupby('category').agg(['mean','std']).T
Out[20]:
Notice that there is quite a bit of overlap between the average values and the bad values. To improve the model fitting, we will throw out the average values and only perform a classification between the 'Good' wine and the 'Bad' wine.
In [21]:
red_wine_newcats = red_wine_cleaned[red_wine_cleaned['category'].isin(['Bad','Good'])].copy()
In [22]:
np.sort(red_wine_newcats['quality'].unique())
Out[22]:
In [23]:
bins = [3, 5, 8]
red_wine_newcats['category'] = pd.cut(red_wine_newcats.quality, bins, labels=['Bad', 'Good'])
In [24]:
red_wine.shape, red_wine_newcats.shape
Out[24]:
In [25]:
sb.pairplot(red_wine_newcats.drop(['quality'],1),hue='category', size=3)
Out[25]:
In [26]:
red_wine_newcats.drop('quality',1).groupby('category').agg(['mean','std']).T
Out[26]:
Examine the 'cleaned' pairplot above. Can you identify any features that appear related to eachother? Hint: Think back to highschool chemistry class. Features that have a linear dependancy are called collinear and can be problematic if they are included in modeling. See the Math Primer for a discussion of collinearity.
Before using TensorFlow, we will use skflow to make the model. Skflow is a Python library the wraps many of the TensorFlow commands in routines the are more like scikit-learn. Therefore, if you are more familiar with scikit-learn, using Skflow can be a good way to get a gental introduction to TensorFlow.
In [27]:
import sklearn
from sklearn import metrics, preprocessing
from sklearn.cross_validation import train_test_split
import skflow
It looks like total sulfar dioxide is a good indicator of wine quality. Let's try using this feature to classify whether the wine is a 'Good' wine or a 'Bad' wine.
In [28]:
y_red_wine = red_wine_newcats[['category']].get_values()
In [29]:
X_red_wine = red_wine_newcats['total sulfur dioxide'].get_values()
In [30]:
X_train, X_test, y_train, y_test = train_test_split(X_red_wine, y_red_wine, test_size=0.2, random_state=42)
The y values are string categories ('Good' and 'Bad') and so need to be converted to integers so that skflow will understand the categories. This is done using fit_transform() from the CategoricalProcessor class in skflow.
In [31]:
cat_processor = skflow.preprocessing.CategoricalProcessor()
y_train_cat = np.array(list(cat_processor.fit_transform(y_train)))
y_test_cat = np.array(list(cat_processor.transform(y_test)))
In [32]:
n_classes = len(cat_processor.vocabularies_[0])
In [33]:
print("There are {0} different classes.").format(n_classes)
In [34]:
# Define the model
def categorical_model(X, y):
return skflow.models.logistic_regression(X, y)
In [35]:
# Train the model
classifier = skflow.TensorFlowEstimator(model_fn=categorical_model,
n_classes=3, learning_rate=0.01)
In [36]:
classifier.fit(X_train, y_train_cat)
Out[36]:
In [37]:
print("Accuracy: {0}".format(metrics.accuracy_score(classifier.predict(X_test), y_test_cat)))
Not bad for a start! Now the model needs to be revised.
Now let's try two features, 'total sulfur dioxide
' and 'density
', to see if this improves the model.
In [38]:
X_red_wine = red_wine_newcats[['total sulfur dioxide','density']].get_values()
In [39]:
X_train, X_test, y_train, y_test = train_test_split(X_red_wine, y_red_wine, test_size=0.2,
random_state=42)
In [40]:
cat_processor = skflow.preprocessing.CategoricalProcessor()
y_train_cat = np.array(list(cat_processor.fit_transform(y_train)))
y_test_cat = np.array(list(cat_processor.transform(y_test)))
In [41]:
n_classes = len(cat_processor.vocabularies_[0])
In [42]:
print("There are {0} different classes.").format(n_classes)
In [43]:
# Define the model
def categorical_model(X, y):
return skflow.models.logistic_regression(X, y)
In [44]:
# Train the model
classifier = skflow.TensorFlowEstimator(model_fn=categorical_model,
n_classes=3, learning_rate=0.01)
In [45]:
classifier.fit(X_train, y_train_cat)
Out[45]:
In [46]:
print("Accuracy: {0}".format(metrics.accuracy_score(classifier.predict(X_test), y_test_cat)))
This fit got worse. Let's see what happens when we consider more features to make a model.
In [47]:
red_wine_newcats.iloc[:,1:-2].head()
Out[47]:
In [48]:
X_red_wine = red_wine_newcats.iloc[:,1:-2].get_values()
In [49]:
X_train, X_test, y_train, y_test = train_test_split(X_red_wine, y_red_wine, test_size=0.2,
random_state=42)
In [50]:
cat_processor = skflow.preprocessing.CategoricalProcessor()
y_train_cat = np.array(list(cat_processor.fit_transform(y_train)))
y_test_cat = np.array(list(cat_processor.transform(y_test)))
In [51]:
n_classes = len(cat_processor.vocabularies_[0])
In [52]:
print("There are {0} different classes.").format(n_classes)
In [53]:
# Define the model
def categorical_model(X, y):
return skflow.models.logistic_regression(X, y)
In [54]:
# Train the model
classifier = skflow.TensorFlowEstimator(model_fn=categorical_model,
n_classes=3, learning_rate=0.005)
In [55]:
classifier.fit(X_train, y_train_cat)
Out[55]:
In [56]:
print("Accuracy: {0}".format(metrics.accuracy_score(classifier.predict(X_test), y_test_cat)))
An improved accuracy!
Now we get serious and will use TensorFlow to model the wine quality data set.
In [57]:
import tensorflow as tf
Convert y-labels from strings to integers. Bad = 1, Good = 0.
In [58]:
y_red_wine_raveled = y_red_wine.ravel()
y_red_wine_integers = [y.replace('Bad', '1') for y in y_red_wine_raveled]
y_red_wine_integers = [y.replace('Good', '0') for y in y_red_wine_integers]
y_red_wine_integers = [np.int(y) for y in y_red_wine_integers]
Convert y-labels to one-hot vectors.
In [59]:
def dense_to_one_hot(labels_dense, num_classes=2):
# Convert class labels from scalars to one-hot vectors
num_labels = len(labels_dense)
index_offset = np.arange(num_labels) * num_classes
labels_one_hot = np.zeros((num_labels, num_classes))
labels_one_hot.flat[index_offset + labels_dense] = 1
return labels_one_hot
In [60]:
y_one_hot = dense_to_one_hot(y_red_wine_integers, num_classes=2)
Divide the data into training and test sets
In [61]:
X_train, X_test, y_train, y_test = train_test_split(X_red_wine, y_one_hot, test_size=0.2, random_state=42)
Define modeling parameters
In [62]:
learning_rate = 0.005
batch_size = 126
In [63]:
X = tf.placeholder("float",[None,10])
Y = tf.placeholder("float",[None,2])
Set model weights and biases.
In [64]:
W = tf.Variable(tf.zeros([10, 2]))
b = tf.Variable(tf.zeros([2]))
Construct the model. We will use softmax regression since this is good for catagorial data.
In [65]:
model = tf.nn.softmax(tf.matmul(X, W) + b)
Minimize the error using cross entropy.
In [66]:
cost = -tf.reduce_mean(Y*tf.log(model))
Define the optimizer. We will use gradient descent.
In [67]:
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
Define a TensorFlow session and setup a directory to store the results for the tensorboard graph visualization utility.
In [68]:
sess = tf.Session()
Initialize all variables and start a TensorFlow session.
In [69]:
init = tf.initialize_all_variables()
sess.run(init)
In [70]:
for i in range(100):
average_cost = 0
number_of_batches = int(len(X_train) / batch_size)
for start, end in zip(range(0, len(X_train), batch_size), range(batch_size, len(X_train), batch_size)):
sess.run(optimizer, feed_dict={X: X_train[start:end], Y: y_train[start:end]})
# Compute average loss
average_cost += sess.run(cost, feed_dict={X: X_train[start:end], Y: y_train[start:end]}) / number_of_batches
print("Epoch:", '%04d' % (i+1), "cost=", "{:.9f}".format(average_cost))
print('Finished optimization!')
Test the model:
In [71]:
correct_prediction = tf.equal(tf.argmax(model, 1), tf.argmax(y_test, 1))
Calculate the accuracy:
In [72]:
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
print("Accuracy:", sess.run(accuracy, feed_dict={X: X_test, Y: y_test}))
Define modeling parameters
In [73]:
learning_rate = 0.005
batch_size = 126
In [74]:
X = tf.placeholder("float",[None,10], name='X-input')
Y = tf.placeholder("float",[None,2], name='y-input')
Set model weights and biases.
In [75]:
W = tf.Variable(tf.zeros([10, 2]),name='Weights')
b = tf.Variable(tf.zeros([2]),name='Biases')
Use a name scope to organize nodes in the graph visualizer. Scope is a TensorFlow library that allows the user to share variables.
In [76]:
with tf.name_scope("Wx_b") as scope:
model = tf.nn.softmax(tf.matmul(X,W) + b)
Add summary ops to collect data
In [77]:
w_hist = tf.histogram_summary("Weights", W)
b_hist = tf.histogram_summary("Biases", b)
y_hist = tf.histogram_summary("model", model)
Define the loss and optimizer functions.
In [78]:
with tf.name_scope("cross_entropy") as scope:
cross_entropy = -tf.reduce_mean(Y*tf.log(model))
ce_summ = tf.scalar_summary("cross entropy", cross_entropy)
with tf.name_scope("train") as scope:
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cross_entropy)
In [79]:
with tf.name_scope("test") as scope:
correct_prediction = tf.equal(tf.argmax(model, 1), tf.argmax(Y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
accuracy_summary = tf.scalar_summary("accuracy", accuracy)
Define a TensorFlow session and setup a directory to store the results for the tensorboard graph visualization utility.
In [80]:
sess = tf.Session()
merged = tf.merge_all_summaries()
writer = tf.train.SummaryWriter("tmp/wine_quality_logs", sess.graph_def)
Initialize all variables and start a TensorFlow session.
In [81]:
init = tf.initialize_all_variables()
sess.run(init)
In [82]:
for i in range(100):
number_of_batches = int(len(X_train) / batch_size)
if i % 10 == 0:
feed = {X: X_test, Y: y_test}
result = sess.run([merged, accuracy], feed_dict=feed)
summary_str = result[0]
acc = result[1]
writer.add_summary(summary_str, i)
print("Accuracy at step %s: %s" % (i, acc))
else:
for start, end in zip(range(0, len(X_train), batch_size), range(batch_size, len(X_train), batch_size)):
feed = {X: X_train[start:end], Y: y_train[start:end]}
sess.run(optimizer, feed_dict=feed)
In [83]:
print("Accuracy:", sess.run(accuracy, feed_dict={X: X_test, Y: y_test}))
Navigate to the directory with this Jupyter notebook. Then launch tensorboard with the following command:
In [84]:
#!python ~/anaconda/bin/tensorboard --logdir=tmp/wine_quality_logs
Reference: Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.