Wine Quality

In this tutorial we will be analyzing a data set on wine quality taken from the UC Irvine Machine Learning Repository. The data consists of a chemical analysis of many types of wine and each is given a quality score. You can read more about the data here. Some math concepts in this tutorial will not be covered in detail. Please see the Math Primer in the GitHub Repo for an excellent discussion of common math concepts in machine learning.



In [1]:

    
import numpy as np

import pandas as pd

Load the data

Load the data using Pandas:



In [2]:

    
red_wine = pd.read_csv('winequality-red.csv',sep=';')

Let's take a look at the data. A good way to do this is using the info(), head(), and describe() functions in Pandas.



In [3]:

    
red_wine.head() # This command displays the column headings and first five rows of data.









    Out[3]:






  
    
      
      fixed acidity
      volatile acidity
      citric acid
      residual sugar
      chlorides
      free sulfur dioxide
      total sulfur dioxide
      density
      pH
      sulphates
      alcohol
      quality
    
  
  
    
      0
      7.4
      0.70
      0.00
      1.9
      0.076
      11.0
      34.0
      0.9978
      3.51
      0.56
      9.4
      5
    
    
      1
      7.8
      0.88
      0.00
      2.6
      0.098
      25.0
      67.0
      0.9968
      3.20
      0.68
      9.8
      5
    
    
      2
      7.8
      0.76
      0.04
      2.3
      0.092
      15.0
      54.0
      0.9970
      3.26
      0.65
      9.8
      5
    
    
      3
      11.2
      0.28
      0.56
      1.9
      0.075
      17.0
      60.0
      0.9980
      3.16
      0.58
      9.8
      6
    
    
      4
      7.4
      0.70
      0.00
      1.9
      0.076
      11.0
      34.0
      0.9978
      3.51
      0.56
      9.4
      5



In [4]:

    
red_wine.describe().T # This command displayes statistics about each column with numerical data.









    Out[4]:






  
    
      
      count
      mean
      std
      min
      25%
      50%
      75%
      max
    
  
  
    
      fixed acidity
      1599.0
      8.319637
      1.741096
      4.60000
      7.1000
      7.90000
      9.200000
      15.90000
    
    
      volatile acidity
      1599.0
      0.527821
      0.179060
      0.12000
      0.3900
      0.52000
      0.640000
      1.58000
    
    
      citric acid
      1599.0
      0.270976
      0.194801
      0.00000
      0.0900
      0.26000
      0.420000
      1.00000
    
    
      residual sugar
      1599.0
      2.538806
      1.409928
      0.90000
      1.9000
      2.20000
      2.600000
      15.50000
    
    
      chlorides
      1599.0
      0.087467
      0.047065
      0.01200
      0.0700
      0.07900
      0.090000
      0.61100
    
    
      free sulfur dioxide
      1599.0
      15.874922
      10.460157
      1.00000
      7.0000
      14.00000
      21.000000
      72.00000
    
    
      total sulfur dioxide
      1599.0
      46.467792
      32.895324
      6.00000
      22.0000
      38.00000
      62.000000
      289.00000
    
    
      density
      1599.0
      0.996747
      0.001887
      0.99007
      0.9956
      0.99675
      0.997835
      1.00369
    
    
      pH
      1599.0
      3.311113
      0.154386
      2.74000
      3.2100
      3.31000
      3.400000
      4.01000
    
    
      sulphates
      1599.0
      0.658149
      0.169507
      0.33000
      0.5500
      0.62000
      0.730000
      2.00000
    
    
      alcohol
      1599.0
      10.422983
      1.065668
      8.40000
      9.5000
      10.20000
      11.100000
      14.90000
    
    
      quality
      1599.0
      5.636023
      0.807569
      3.00000
      5.0000
      6.00000
      6.000000
      8.00000



In [5]:

    
red_wine.info() # This command displays the column types and size for the data frame.









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
fixed acidity           1599 non-null float64
volatile acidity        1599 non-null float64
citric acid             1599 non-null float64
residual sugar          1599 non-null float64
chlorides               1599 non-null float64
free sulfur dioxide     1599 non-null float64
total sulfur dioxide    1599 non-null float64
density                 1599 non-null float64
pH                      1599 non-null float64
sulphates               1599 non-null float64
alcohol                 1599 non-null float64
quality                 1599 non-null int64
dtypes: float64(11), int64(1)
memory usage: 150.0 KB

Discussion Question:

Looking at the dataset, what is the dimensionality? How many features are there? Can you identify which column is the label column?

We will focus on the red wines for this study. Let's see how the red wines range in quality score...

Scatter plot matrix

To better understand the data, we will use a scatter plot matrix. This is a plot comparing all values in the data set by plotting one value versus the other. All features are labeled both on the x-axis and the y-axis. If, for example, you want to compare citric acid and sulphates you would read off citric acid from the x-axis and then go up to sulphates on the y-axis to find the plot that compares these two features. If you compare the same feature with itself you will see a histogram showing it's distribution. It provides a good way to see how the different features are related to eachother.

We will use the Python visualization library Seaborn to make these plots. Seaborn is a great tool for high level statistical graphics.



In [6]:

    
import seaborn as sb
sb.set_context("notebook", font_scale=2.5)

from matplotlib import pyplot as plt

%matplotlib inline



In [7]:

    
sb.pairplot(red_wine, size=3)









    Out[7]:





<seaborn.axisgrid.PairGrid at 0x1167ddc90>

Outliers

If you look at the plots you will notice some of the features show outliers. For example, let's look at the plots comparing sulphates and free sulfar dioxide. I circle in red points that look like there clearly deviate from the mean.



In [8]:

    
from IPython.display import Image
Image(filename='images/fig1.png')









    Out[8]:



In [9]:

    
red_wine[['total sulfur dioxide', 'sulphates']].describe().T









    Out[9]:






  
    
      
      count
      mean
      std
      min
      25%
      50%
      75%
      max
    
  
  
    
      total sulfur dioxide
      1599.0
      46.467792
      32.895324
      6.00
      22.00
      38.00
      62.00
      289.0
    
    
      sulphates
      1599.0
      0.658149
      0.169507
      0.33
      0.55
      0.62
      0.73
      2.0

To exclude the outliers I defined the function below. It takes as input a dataframe, a threshold defined as number of standard deviations from the mean, and which columns you want to 'clean'.



In [10]:

    
def outliers(df, threshold, columns):
    for col in columns: 
        mask = df[col] > float(threshold)*df[col].std()+df[col].mean()
        df.loc[mask == True,col] = np.nan
        mean_property = df.loc[:,col].mean()
        df.loc[mask == True,col] = mean_property
    return df



In [11]:

    
column_list = red_wine.columns.tolist() # Save the column names for the wine dataframe to a list.

Below, I set the threshold to five. Any value more than five standard deviations from the mean will be labeled as an outlier.



In [12]:

    
threshold = 5



In [13]:

    
red_wine_cleaned = red_wine.copy()
red_wine_cleaned = outliers(red_wine_cleaned, 5, column_list[0:-1])
red_wine_cleaned.describe().T









    Out[13]:






  
    
      
      count
      mean
      std
      min
      25%
      50%
      75%
      max
    
  
  
    
      fixed acidity
      1599.0
      8.319637
      1.741096
      4.60000
      7.1000
      7.90000
      9.200000
      15.90000
    
    
      volatile acidity
      1599.0
      0.527162
      0.177113
      0.12000
      0.3900
      0.52000
      0.640000
      1.33000
    
    
      citric acid
      1599.0
      0.270976
      0.194801
      0.00000
      0.0900
      0.26000
      0.420000
      1.00000
    
    
      residual sugar
      1599.0
      2.463948
      1.076293
      0.90000
      1.9000
      2.20000
      2.600000
      9.00000
    
    
      chlorides
      1599.0
      0.082942
      0.025960
      0.01200
      0.0700
      0.07900
      0.089000
      0.27000
    
    
      free sulfur dioxide
      1599.0
      15.839800
      10.365444
      1.00000
      7.0000
      14.00000
      21.000000
      68.00000
    
    
      total sulfur dioxide
      1599.0
      46.170946
      31.806575
      6.00000
      22.0000
      38.00000
      62.000000
      165.00000
    
    
      density
      1599.0
      0.996747
      0.001887
      0.99007
      0.9956
      0.99675
      0.997835
      1.00369
    
    
      pH
      1599.0
      3.311113
      0.154386
      2.74000
      3.2100
      3.31000
      3.400000
      4.01000
    
    
      sulphates
      1599.0
      0.652495
      0.148975
      0.33000
      0.5500
      0.62000
      0.730000
      1.36000
    
    
      alcohol
      1599.0
      10.422983
      1.065668
      8.40000
      9.5000
      10.20000
      11.100000
      14.90000
    
    
      quality
      1599.0
      5.636023
      0.807569
      3.00000
      5.0000
      6.00000
      6.000000
      8.00000

Now let's examine the data again with the outliers removed.



In [14]:

    
sb.pairplot(red_wine_cleaned, size=3)









    Out[14]:





<seaborn.axisgrid.PairGrid at 0x126ce2cd0>

Let's compare the two features total sulfur dioxide and sulphates again, with and with out outliers, to see the difference.



In [15]:

    
sb.set_context("notebook", font_scale=1)
pp = sb.pairplot(red_wine[['total sulfur dioxide', 'sulphates']], size=3)
plt.subplots_adjust(top=0.9)
pp.fig.suptitle('With Outliers', fontsize=20, verticalalignment='top')









    Out[15]:





<matplotlib.text.Text at 0x12ae4ea50>



In [16]:

    
sb.set_context("notebook", font_scale=1)
pp = sb.pairplot(red_wine_cleaned[['total sulfur dioxide', 'sulphates']], size=3)
plt.subplots_adjust(top=0.9)
pp.fig.suptitle('Without Outliers', fontsize=20, verticalalignment='top')









    Out[16]:





<matplotlib.text.Text at 0x1391f2990>

Exercise:

Change the threshold value and see how this affects the results.

Discussion Question:

In this tutorial, we defined bad values as being outliers and removed them. Can you think of other types of bad values one might consider? How can they be removed?

Binning the data by category

Now we will bin the data to define categories. Our model will try to infer the category given the various chemical properties meassured for the wine dataset.



In [17]:

    
print("The range is wine quality is {0}".format(np.sort(red_wine_cleaned['quality'].unique())))









    



The range is wine quality is [3 4 5 6 7 8]

First, we will bin the data into three bins based on their quality, 'Bad', 'Average', and 'Good'.



In [18]:

    
bins = [3, 5, 6, 8]
red_wine_cleaned['category'] = pd.cut(red_wine_cleaned.quality, bins, labels=['Bad', 'Average', 'Good'])



In [19]:

    
sb.pairplot(red_wine_cleaned.drop(['quality'],1),hue='category', size=3)









    Out[19]:





<seaborn.axisgrid.PairGrid at 0x1387dff90>

We can examine some metrics by category using the Pandas routines groupby() and agg(). I won't discuss these routines in the tutorial, but if you want to learn more, please read the Pandas documentation.



In [20]:

    
red_wine_cleaned.drop('quality',1).groupby('category').agg(['mean','std']).T









    Out[20]:






  
    
      
      category
      Bad
      Average
      Good
    
  
  
    
      fixed acidity
      mean
      8.139237
      8.347179
      8.847005
    
    
      std
      1.570668
      1.797849
      1.999977
    
    
      volatile acidity
      mean
      0.585484
      0.497484
      0.405530
    
    
      std
      0.171900
      0.160962
      0.144963
    
    
      citric acid
      mean
      0.238665
      0.273824
      0.376498
    
    
      std
      0.182344
      0.195108
      0.194438
    
    
      residual sugar
      mean
      2.477937
      2.361909
      2.708756
    
    
      std
      1.104039
      0.902309
      1.363026
    
    
      chlorides
      mean
      0.085496
      0.082206
      0.074645
    
    
      std
      0.024194
      0.027374
      0.021008
    
    
      free sulfur dioxide
      mean
      16.643052
      15.623573
      13.981567
    
    
      std
      10.891157
      9.687108
      10.234615
    
    
      total sulfur dioxide
      mean
      55.050409
      40.869906
      32.702036
    
    
      std
      36.756984
      25.038250
      22.017111
    
    
      density
      mean
      0.997063
      0.996615
      0.996030
    
    
      std
      0.001593
      0.002000
      0.002201
    
    
      pH
      mean
      3.310477
      3.318072
      3.288802
    
    
      std
      0.154189
      0.153995
      0.154478
    
    
      sulphates
      mean
      0.611720
      0.669761
      0.743456
    
    
      std
      0.149053
      0.136632
      0.134038
    
    
      alcohol
      mean
      9.926090
      10.629519
      11.518049
    
    
      std
      0.757750
      1.049639
      0.998153

Notice that there is quite a bit of overlap between the average values and the bad values. To improve the model fitting, we will throw out the average values and only perform a classification between the 'Good' wine and the 'Bad' wine.



In [21]:

    
red_wine_newcats = red_wine_cleaned[red_wine_cleaned['category'].isin(['Bad','Good'])].copy()



In [22]:

    
np.sort(red_wine_newcats['quality'].unique())









    Out[22]:





array([4, 5, 7, 8])



In [23]:

    
bins = [3, 5, 8]
red_wine_newcats['category'] = pd.cut(red_wine_newcats.quality, bins, labels=['Bad', 'Good'])



In [24]:

    
red_wine.shape, red_wine_newcats.shape









    Out[24]:





((1599, 12), (951, 13))



In [25]:

    
sb.pairplot(red_wine_newcats.drop(['quality'],1),hue='category', size=3)









    Out[25]:





<seaborn.axisgrid.PairGrid at 0x12b7b3190>



In [26]:

    
red_wine_newcats.drop('quality',1).groupby('category').agg(['mean','std']).T









    Out[26]:






  
    
      
      category
      Bad
      Good
    
  
  
    
      fixed acidity
      mean
      8.139237
      8.847005
    
    
      std
      1.570668
      1.999977
    
    
      volatile acidity
      mean
      0.585484
      0.405530
    
    
      std
      0.171900
      0.144963
    
    
      citric acid
      mean
      0.238665
      0.376498
    
    
      std
      0.182344
      0.194438
    
    
      residual sugar
      mean
      2.477937
      2.708756
    
    
      std
      1.104039
      1.363026
    
    
      chlorides
      mean
      0.085496
      0.074645
    
    
      std
      0.024194
      0.021008
    
    
      free sulfur dioxide
      mean
      16.643052
      13.981567
    
    
      std
      10.891157
      10.234615
    
    
      total sulfur dioxide
      mean
      55.050409
      32.702036
    
    
      std
      36.756984
      22.017111
    
    
      density
      mean
      0.997063
      0.996030
    
    
      std
      0.001593
      0.002201
    
    
      pH
      mean
      3.310477
      3.288802
    
    
      std
      0.154189
      0.154478
    
    
      sulphates
      mean
      0.611720
      0.743456
    
    
      std
      0.149053
      0.134038
    
    
      alcohol
      mean
      9.926090
      11.518049
    
    
      std
      0.757750
      0.998153

Discussion Question:

Examine the 'cleaned' pairplot above. Can you identify any features that appear related to eachother? Hint: Think back to highschool chemistry class. Features that have a linear dependancy are called collinear and can be problematic if they are included in modeling. See the Math Primer for a discussion of collinearity.

Using skflow

Before using TensorFlow, we will use skflow to make the model. Skflow is a Python library the wraps many of the TensorFlow commands in routines the are more like scikit-learn. Therefore, if you are more familiar with scikit-learn, using Skflow can be a good way to get a gental introduction to TensorFlow.



In [27]:

    
import sklearn
from sklearn import metrics, preprocessing
from sklearn.cross_validation import train_test_split

import skflow

It looks like total sulfar dioxide is a good indicator of wine quality. Let's try using this feature to classify whether the wine is a 'Good' wine or a 'Bad' wine.

Separate data into labels and features

Here we separate the data into 'labels' (y values) and 'feature' (X values) and divide them into training and test sets using train_test_split() from scikit-learn.



In [28]:

    
y_red_wine = red_wine_newcats[['category']].get_values()



In [29]:

    
X_red_wine = red_wine_newcats['total sulfur dioxide'].get_values()



In [30]:

    
X_train, X_test, y_train, y_test = train_test_split(X_red_wine, y_red_wine, test_size=0.2, random_state=42)

The y values are string categories ('Good' and 'Bad') and so need to be converted to integers so that skflow will understand the categories. This is done using fit_transform() from the CategoricalProcessor class in skflow.



In [31]:

    
cat_processor = skflow.preprocessing.CategoricalProcessor()
y_train_cat = np.array(list(cat_processor.fit_transform(y_train)))
y_test_cat = np.array(list(cat_processor.transform(y_test)))



In [32]:

    
n_classes = len(cat_processor.vocabularies_[0])



In [33]:

    
print("There are {0} different classes.").format(n_classes)









    



There are 3 different classes.



In [34]:

    
# Define the model
def categorical_model(X, y):
    return skflow.models.logistic_regression(X, y)



In [35]:

    
# Train the model
classifier = skflow.TensorFlowEstimator(model_fn=categorical_model,
    n_classes=3, learning_rate=0.01)



In [36]:

    
classifier.fit(X_train, y_train_cat)









    



Step #1, avg. loss: 99.96664
Step #21, avg. loss: 68.15295
Step #41, epoch #1, avg. loss: 11.87642
Step #61, epoch #2, avg. loss: 1.69902
Step #81, epoch #3, avg. loss: 0.73753
Step #101, epoch #4, avg. loss: 0.77219
Step #121, epoch #5, avg. loss: 0.72735
Step #141, epoch #5, avg. loss: 0.73165
Step #161, epoch #6, avg. loss: 0.67294
Step #181, epoch #7, avg. loss: 0.67628






    Out[36]:





TensorFlowEstimator(batch_size=32, class_weight=None, continue_training=False,
          early_stopping_rounds=None, keep_checkpoint_every_n_hours=10000,
          learning_rate=0.01, max_to_keep=5,
          model_fn=<function categorical_model at 0x154a61b90>,
          n_classes=3, num_cores=4, optimizer='SGD', steps=200,
          tf_master='', tf_random_seed=42, verbose=1)



In [37]:

    
print("Accuracy: {0}".format(metrics.accuracy_score(classifier.predict(X_test), y_test_cat)))









    



Accuracy: 0.764397905759

Not bad for a start! Now the model needs to be revised.

Categorical Model with Two Features

Now let's try two features, 'total sulfur dioxide' and 'density', to see if this improves the model.



In [38]:

    
X_red_wine = red_wine_newcats[['total sulfur dioxide','density']].get_values()



In [39]:

    
X_train, X_test, y_train, y_test = train_test_split(X_red_wine, y_red_wine, test_size=0.2, 
                                                    random_state=42)



In [40]:

    
cat_processor = skflow.preprocessing.CategoricalProcessor()
y_train_cat = np.array(list(cat_processor.fit_transform(y_train)))
y_test_cat = np.array(list(cat_processor.transform(y_test)))



In [41]:

    
n_classes = len(cat_processor.vocabularies_[0])



In [42]:

    
print("There are {0} different classes.").format(n_classes)









    



There are 3 different classes.



In [43]:

    
# Define the model
def categorical_model(X, y):
    return skflow.models.logistic_regression(X, y)



In [44]:

    
# Train the model
classifier = skflow.TensorFlowEstimator(model_fn=categorical_model,
    n_classes=3, learning_rate=0.01)



In [45]:

    
classifier.fit(X_train, y_train_cat)









    



Step #1, avg. loss: 69.17908
Step #21, avg. loss: 36.44724
Step #41, epoch #1, avg. loss: 2.69154
Step #61, epoch #2, avg. loss: 0.91624
Step #81, epoch #3, avg. loss: 0.81134
Step #101, epoch #4, avg. loss: 0.81615
Step #121, epoch #5, avg. loss: 0.75228
Step #141, epoch #5, avg. loss: 0.70417
Step #161, epoch #6, avg. loss: 0.65185
Step #181, epoch #7, avg. loss: 0.89500






    Out[45]:





TensorFlowEstimator(batch_size=32, class_weight=None, continue_training=False,
          early_stopping_rounds=None, keep_checkpoint_every_n_hours=10000,
          learning_rate=0.01, max_to_keep=5,
          model_fn=<function categorical_model at 0x1655a7140>,
          n_classes=3, num_cores=4, optimizer='SGD', steps=200,
          tf_master='', tf_random_seed=42, verbose=1)



In [46]:

    
print("Accuracy: {0}".format(metrics.accuracy_score(classifier.predict(X_test), y_test_cat)))









    



Accuracy: 0.612565445026

This fit got worse. Let's see what happens when we consider more features to make a model.

Catagorical Model Using Ten Features

We will now add additional features. Let's include all features except fixed acidity, which shows some colinearity with pH and density.



In [47]:

    
red_wine_newcats.iloc[:,1:-2].head()









    Out[47]:






  
    
      
      volatile acidity
      citric acid
      residual sugar
      chlorides
      free sulfur dioxide
      total sulfur dioxide
      density
      pH
      sulphates
      alcohol
    
  
  
    
      0
      0.70
      0.00
      1.9
      0.076
      11.0
      34.0
      0.9978
      3.51
      0.56
      9.4
    
    
      1
      0.88
      0.00
      2.6
      0.098
      25.0
      67.0
      0.9968
      3.20
      0.68
      9.8
    
    
      2
      0.76
      0.04
      2.3
      0.092
      15.0
      54.0
      0.9970
      3.26
      0.65
      9.8
    
    
      4
      0.70
      0.00
      1.9
      0.076
      11.0
      34.0
      0.9978
      3.51
      0.56
      9.4
    
    
      5
      0.66
      0.00
      1.8
      0.075
      13.0
      40.0
      0.9978
      3.51
      0.56
      9.4



In [48]:

    
X_red_wine = red_wine_newcats.iloc[:,1:-2].get_values()



In [49]:

    
X_train, X_test, y_train, y_test = train_test_split(X_red_wine, y_red_wine, test_size=0.2, 
                                                    random_state=42)



In [50]:

    
cat_processor = skflow.preprocessing.CategoricalProcessor()
y_train_cat = np.array(list(cat_processor.fit_transform(y_train)))
y_test_cat = np.array(list(cat_processor.transform(y_test)))



In [51]:

    
n_classes = len(cat_processor.vocabularies_[0])



In [52]:

    
print("There are {0} different classes.").format(n_classes)









    



There are 3 different classes.



In [53]:

    
# Define the model
def categorical_model(X, y):
    return skflow.models.logistic_regression(X, y)



In [54]:

    
# Train the model
classifier = skflow.TensorFlowEstimator(model_fn=categorical_model,
    n_classes=3, learning_rate=0.005)



In [55]:

    
classifier.fit(X_train, y_train_cat)









    



Step #1, avg. loss: 1.26004
Step #21, avg. loss: 0.83428
Step #41, epoch #1, avg. loss: 0.63736
Step #61, epoch #2, avg. loss: 0.67408
Step #81, epoch #3, avg. loss: 0.53979
Step #101, epoch #4, avg. loss: 0.52675
Step #121, epoch #5, avg. loss: 0.53324
Step #141, epoch #5, avg. loss: 0.50395
Step #161, epoch #6, avg. loss: 0.52835
Step #181, epoch #7, avg. loss: 0.54818






    Out[55]:





TensorFlowEstimator(batch_size=32, class_weight=None, continue_training=False,
          early_stopping_rounds=None, keep_checkpoint_every_n_hours=10000,
          learning_rate=0.005, max_to_keep=5,
          model_fn=<function categorical_model at 0x16af866e0>,
          n_classes=3, num_cores=4, optimizer='SGD', steps=200,
          tf_master='', tf_random_seed=42, verbose=1)



In [56]:

    
print("Accuracy: {0}".format(metrics.accuracy_score(classifier.predict(X_test), y_test_cat)))









    



Accuracy: 0.795811518325

An improved accuracy!

Tensor Flow

Now we get serious and will use TensorFlow to model the wine quality data set.



In [57]:

    
import tensorflow as tf

Convert y-labels from strings to integers. Bad = 1, Good = 0.



In [58]:

    
y_red_wine_raveled = y_red_wine.ravel()
y_red_wine_integers = [y.replace('Bad', '1') for y in y_red_wine_raveled]
y_red_wine_integers = [y.replace('Good', '0') for y in y_red_wine_integers]
y_red_wine_integers = [np.int(y) for y in y_red_wine_integers]

Convert y-labels to one-hot vectors.



In [59]:

    
def dense_to_one_hot(labels_dense, num_classes=2):
  # Convert class labels from scalars to one-hot vectors
  num_labels = len(labels_dense)
  index_offset = np.arange(num_labels) * num_classes
  labels_one_hot = np.zeros((num_labels, num_classes))
  labels_one_hot.flat[index_offset + labels_dense] = 1
  return labels_one_hot



In [60]:

    
y_one_hot = dense_to_one_hot(y_red_wine_integers, num_classes=2)

Divide the data into training and test sets



In [61]:

    
X_train, X_test, y_train, y_test = train_test_split(X_red_wine, y_one_hot, test_size=0.2, random_state=42)

Define modeling parameters



In [62]:

    
learning_rate = 0.005
batch_size = 126



In [63]:

    
X = tf.placeholder("float",[None,10])
Y = tf.placeholder("float",[None,2])

Set model weights and biases.



In [64]:

    
W = tf.Variable(tf.zeros([10, 2]))
b = tf.Variable(tf.zeros([2]))

Construct the model. We will use softmax regression since this is good for catagorial data.



In [65]:

    
model = tf.nn.softmax(tf.matmul(X, W) + b)

Minimize the error using cross entropy.



In [66]:

    
cost = -tf.reduce_mean(Y*tf.log(model))

Define the optimizer. We will use gradient descent.



In [67]:

    
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

Define a TensorFlow session and setup a directory to store the results for the tensorboard graph visualization utility.



In [68]:

    
sess = tf.Session()

Initialize all variables and start a TensorFlow session.



In [69]:

    
init = tf.initialize_all_variables()
sess.run(init)



In [70]:

    
for i in range(100):
    average_cost = 0
    number_of_batches = int(len(X_train) / batch_size)
    for start, end in zip(range(0, len(X_train), batch_size), range(batch_size, len(X_train), batch_size)):
        sess.run(optimizer, feed_dict={X: X_train[start:end], Y: y_train[start:end]})
        # Compute average loss
        average_cost += sess.run(cost, feed_dict={X: X_train[start:end], Y: y_train[start:end]}) / number_of_batches
    print("Epoch:", '%04d' % (i+1), "cost=", "{:.9f}".format(average_cost))
    
print('Finished optimization!')









    



Epoch: 0001 cost= 0.268023300
Epoch: 0002 cost= 0.243357231
Epoch: 0003 cost= 0.241053467
Epoch: 0004 cost= 0.239559305
Epoch: 0005 cost= 0.238535017
Epoch: 0006 cost= 0.237802530
Epoch: 0007 cost= 0.237258452
Epoch: 0008 cost= 0.236839890
Epoch: 0009 cost= 0.236506825
Epoch: 0010 cost= 0.236233046
Epoch: 0011 cost= 0.236001134
Epoch: 0012 cost= 0.235799074
Epoch: 0013 cost= 0.235618522
Epoch: 0014 cost= 0.235453608
Epoch: 0015 cost= 0.235300203
Epoch: 0016 cost= 0.235155294
Epoch: 0017 cost= 0.235016706
Epoch: 0018 cost= 0.234882923
Epoch: 0019 cost= 0.234752697
Epoch: 0020 cost= 0.234625238
Epoch: 0021 cost= 0.234499892
Epoch: 0022 cost= 0.234376207
Epoch: 0023 cost= 0.234253873
Epoch: 0024 cost= 0.234132508
Epoch: 0025 cost= 0.234012025
Epoch: 0026 cost= 0.233892240
Epoch: 0027 cost= 0.233773025
Epoch: 0028 cost= 0.233654253
Epoch: 0029 cost= 0.233535913
Epoch: 0030 cost= 0.233417948
Epoch: 0031 cost= 0.233300296
Epoch: 0032 cost= 0.233182952
Epoch: 0033 cost= 0.233065893
Epoch: 0034 cost= 0.232949061
Epoch: 0035 cost= 0.232832474
Epoch: 0036 cost= 0.232716091
Epoch: 0037 cost= 0.232599884
Epoch: 0038 cost= 0.232483879
Epoch: 0039 cost= 0.232368092
Epoch: 0040 cost= 0.232252424
Epoch: 0041 cost= 0.232136995
Epoch: 0042 cost= 0.232021729
Epoch: 0043 cost= 0.231906618
Epoch: 0044 cost= 0.231791668
Epoch: 0045 cost= 0.231676879
Epoch: 0046 cost= 0.231562314
Epoch: 0047 cost= 0.231447828
Epoch: 0048 cost= 0.231333544
Epoch: 0049 cost= 0.231219411
Epoch: 0050 cost= 0.231105462
Epoch: 0051 cost= 0.230991629
Epoch: 0052 cost= 0.230877943
Epoch: 0053 cost= 0.230764436
Epoch: 0054 cost= 0.230651093
Epoch: 0055 cost= 0.230537889
Epoch: 0056 cost= 0.230424816
Epoch: 0057 cost= 0.230311940
Epoch: 0058 cost= 0.230199186
Epoch: 0059 cost= 0.230086570
Epoch: 0060 cost= 0.229974133
Epoch: 0061 cost= 0.229861818
Epoch: 0062 cost= 0.229749665
Epoch: 0063 cost= 0.229637660
Epoch: 0064 cost= 0.229525812
Epoch: 0065 cost= 0.229414105
Epoch: 0066 cost= 0.229302563
Epoch: 0067 cost= 0.229191149
Epoch: 0068 cost= 0.229079922
Epoch: 0069 cost= 0.228968794
Epoch: 0070 cost= 0.228857828
Epoch: 0071 cost= 0.228747023
Epoch: 0072 cost= 0.228636310
Epoch: 0073 cost= 0.228525802
Epoch: 0074 cost= 0.228415422
Epoch: 0075 cost= 0.228305228
Epoch: 0076 cost= 0.228195091
Epoch: 0077 cost= 0.228085168
Epoch: 0078 cost= 0.227975398
Epoch: 0079 cost= 0.227865706
Epoch: 0080 cost= 0.227756252
Epoch: 0081 cost= 0.227646895
Epoch: 0082 cost= 0.227537684
Epoch: 0083 cost= 0.227428588
Epoch: 0084 cost= 0.227319688
Epoch: 0085 cost= 0.227210897
Epoch: 0086 cost= 0.227102265
Epoch: 0087 cost= 0.226993779
Epoch: 0088 cost= 0.226885455
Epoch: 0089 cost= 0.226777221
Epoch: 0090 cost= 0.226669163
Epoch: 0091 cost= 0.226561258
Epoch: 0092 cost= 0.226453520
Epoch: 0093 cost= 0.226345887
Epoch: 0094 cost= 0.226238362
Epoch: 0095 cost= 0.226131010
Epoch: 0096 cost= 0.226023808
Epoch: 0097 cost= 0.225916783
Epoch: 0098 cost= 0.225809832
Epoch: 0099 cost= 0.225703061
Epoch: 0100 cost= 0.225596403
Finished optimization!

Test the model:



In [71]:

    
correct_prediction = tf.equal(tf.argmax(model, 1), tf.argmax(y_test, 1))

Calculate the accuracy:



In [72]:

    
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
print("Accuracy:", sess.run(accuracy, feed_dict={X: X_test, Y: y_test}))









    



Accuracy: 0.764398

Exercise:

Change the learning rate and batch size and see how this affects the results.

TensorFlow with Tensorboard

Now, let's rerun TensorFlow, but use Tensorboard. This will give us a graphical representation of how the model is created and fit.

Define modeling parameters



In [73]:

    
learning_rate = 0.005
batch_size = 126



In [74]:

    
X = tf.placeholder("float",[None,10], name='X-input')
Y = tf.placeholder("float",[None,2], name='y-input')

Set model weights and biases.



In [75]:

    
W = tf.Variable(tf.zeros([10, 2]),name='Weights')
b = tf.Variable(tf.zeros([2]),name='Biases')

Use a name scope to organize nodes in the graph visualizer. Scope is a TensorFlow library that allows the user to share variables.



In [76]:

    
with tf.name_scope("Wx_b") as scope:
  model = tf.nn.softmax(tf.matmul(X,W) + b)

Add summary ops to collect data



In [77]:

    
w_hist = tf.histogram_summary("Weights", W)
b_hist = tf.histogram_summary("Biases", b)
y_hist = tf.histogram_summary("model", model)

Define the loss and optimizer functions.



In [78]:

    
with tf.name_scope("cross_entropy") as scope:
  cross_entropy = -tf.reduce_mean(Y*tf.log(model))
  ce_summ = tf.scalar_summary("cross entropy", cross_entropy)    
with tf.name_scope("train") as scope:
  optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cross_entropy)



In [79]:

    
with tf.name_scope("test") as scope:
  correct_prediction = tf.equal(tf.argmax(model, 1), tf.argmax(Y, 1))
  accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
  accuracy_summary = tf.scalar_summary("accuracy", accuracy)

Define a TensorFlow session and setup a directory to store the results for the tensorboard graph visualization utility.



In [80]:

    
sess = tf.Session()
merged = tf.merge_all_summaries()
writer = tf.train.SummaryWriter("tmp/wine_quality_logs", sess.graph_def)

Initialize all variables and start a TensorFlow session.



In [81]:

    
init = tf.initialize_all_variables()
sess.run(init)



In [82]:

    
for i in range(100):
    number_of_batches = int(len(X_train) / batch_size)
    if i % 10 == 0:
        feed = {X: X_test, Y: y_test}
        result = sess.run([merged, accuracy], feed_dict=feed)
        summary_str = result[0]
        acc = result[1]
        writer.add_summary(summary_str, i)
        print("Accuracy at step %s: %s" % (i, acc))
    else:
        for start, end in zip(range(0, len(X_train), batch_size), range(batch_size, len(X_train), batch_size)):
            feed = {X: X_train[start:end], Y: y_train[start:end]}
            sess.run(optimizer, feed_dict=feed)









    



Accuracy at step 0: 0.235602
Accuracy at step 10: 0.764398
Accuracy at step 20: 0.764398
Accuracy at step 30: 0.764398
Accuracy at step 40: 0.764398
Accuracy at step 50: 0.764398
Accuracy at step 60: 0.764398
Accuracy at step 70: 0.764398
Accuracy at step 80: 0.764398
Accuracy at step 90: 0.764398



In [83]:

    
print("Accuracy:", sess.run(accuracy, feed_dict={X: X_test, Y: y_test}))









    



Accuracy: 0.764398

Navigate to the directory with this Jupyter notebook. Then launch tensorboard with the following command:



In [84]:

    
#!python ~/anaconda/bin/tensorboard --logdir=tmp/wine_quality_logs

Reference: Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5
3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.9980	3.16	0.58	9.8	6
4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5

	count	mean	std	min	25%	50%	75%	max
fixed acidity	1599.0	8.319637	1.741096	4.60000	7.1000	7.90000	9.200000	15.90000
volatile acidity	1599.0	0.527821	0.179060	0.12000	0.3900	0.52000	0.640000	1.58000
citric acid	1599.0	0.270976	0.194801	0.00000	0.0900	0.26000	0.420000	1.00000
residual sugar	1599.0	2.538806	1.409928	0.90000	1.9000	2.20000	2.600000	15.50000
chlorides	1599.0	0.087467	0.047065	0.01200	0.0700	0.07900	0.090000	0.61100
free sulfur dioxide	1599.0	15.874922	10.460157	1.00000	7.0000	14.00000	21.000000	72.00000
total sulfur dioxide	1599.0	46.467792	32.895324	6.00000	22.0000	38.00000	62.000000	289.00000
density	1599.0	0.996747	0.001887	0.99007	0.9956	0.99675	0.997835	1.00369
pH	1599.0	3.311113	0.154386	2.74000	3.2100	3.31000	3.400000	4.01000
sulphates	1599.0	0.658149	0.169507	0.33000	0.5500	0.62000	0.730000	2.00000
alcohol	1599.0	10.422983	1.065668	8.40000	9.5000	10.20000	11.100000	14.90000
quality	1599.0	5.636023	0.807569	3.00000	5.0000	6.00000	6.000000	8.00000

	category	Bad	Average	Good
fixed acidity	mean	8.139237	8.347179	8.847005
fixed acidity	std	1.570668	1.797849	1.999977
volatile acidity	mean	0.585484	0.497484	0.405530
volatile acidity	std	0.171900	0.160962	0.144963
citric acid	mean	0.238665	0.273824	0.376498
citric acid	std	0.182344	0.195108	0.194438
residual sugar	mean	2.477937	2.361909	2.708756
residual sugar	std	1.104039	0.902309	1.363026
chlorides	mean	0.085496	0.082206	0.074645
chlorides	std	0.024194	0.027374	0.021008
free sulfur dioxide	mean	16.643052	15.623573	13.981567
free sulfur dioxide	std	10.891157	9.687108	10.234615
total sulfur dioxide	mean	55.050409	40.869906	32.702036
total sulfur dioxide	std	36.756984	25.038250	22.017111
density	mean	0.997063	0.996615	0.996030
density	std	0.001593	0.002000	0.002201
pH	mean	3.310477	3.318072	3.288802
pH	std	0.154189	0.153995	0.154478
sulphates	mean	0.611720	0.669761	0.743456
sulphates	std	0.149053	0.136632	0.134038
alcohol	mean	9.926090	10.629519	11.518049
alcohol	std	0.757750	1.049639	0.998153