Training a classification model for wine production quality

Objective:

In this lab, you will use the Keras Sequential API to create a classification model. You will learn how to use the tf.data API for creating input pipelines and use feature columns to prepare the data to be consumed by a neural network.

Lab Scope:

This lab does not cover how to make predictions on the model or deploy it to Cloud AI Platform.

Learning objectives:

  1. Apply techniques to clean and inspect data.
  2. Split dataset into training, validation and test datasets.
  3. Use the tf.data.Dataset to create an input pipeline.
  4. Use feature columns to prepare the data to be tained by a neural network.
  5. Define, compile and train a model using the Keras Sequential API.

In a classification problem, we aim to select the output from a limited set of discrete values, like a category or a class. Contrast this with a regression problem, where we aim to predict a value from a continuos range of values.

This notebook uses the Wine Production Quality Dataset and builds a model to predict the production quality of wine given a set of attributes such as its citric acidity, density, and others.

To do this, we'll provide the model with examples of different wines produced, that received a rating from an evaluator. The ratings are provided by the numbers 0 - 10 (0 being of very poor quality and 10 being of great quality). We will then try and use this model to predict the rate a new wine will receive by infering towards the trained model.

Since we are learning how to use the Tensorflow 2.x API, this example uses the tf.keras API. Please see this guide for details.


In [ ]:
# Ensure the right version of Tensorflow is installed.
!pip freeze | grep tensorflow==2.1 || pip install tensorflow==2.1

In [ ]:
from __future__ import absolute_import, division, print_function, unicode_literals

import pathlib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers

print(tf.__version__)

The Wine Quality Dataset

The dataset is available in the UCI Machine Learning Repository.

Get the data

There is a copy of the White Wine dataset available on Google Cloud Storage (GCS). The cell below shows the location of the CSV file.


In [ ]:
dataset_path = "gs://cloud-training-demos/wine_quality/winequality-white.csv"

To visualize and manipulate the data, we will use pandas.

First step is to import the data. We should list the columns that will be used to train our model. These column names will define what data will compose the dataframe object in pandas.


In [ ]:
column_names = ['fixed_acidity','volatile_acidity','citric_acid','residual_sugar',
                'chlorides','free_sulfur_dioxide','total_sulfur_dioxide','density',
                'pH','sulphates','alcohol','quality']

In [ ]:
raw_dataframe = pd.read_csv(dataset_path, names=column_names, header = 0, 
                      na_values = " ", comment='\t',
                      sep=";", skipinitialspace=True)

raw_dataframe = raw_dataframe.astype(float)
raw_dataframe['quality'] = raw_dataframe['quality'].astype(int)
dataframe= raw_dataframe.copy()

Clean the data

Datasets sometimes can have null values. Running the next cell counts how many null values exist on each one of the columns.

Note: There are many other steps to make sure the data is clean, but this is out of the scope of this exercise.


In [ ]:
dataframe.isna().sum()

We can see that, on this dataset, there are no null values. If there were any, we could run dataframe = dataframe.dropna() to drop them and make this tutorial simpler.

Inspect the data

Let's take a look at the dataframe content. The tail() method, when ran on a dataframe, shows the last n roles (n is 5 by default).


In [ ]:
dataframe.tail()

In [ ]:
data_stats = dataframe.describe()
data_stats = data_stats.transpose()
data_stats

Have a quick look at the joint distribution of a few pairs of columns from the training set:


In [ ]:
import seaborn as sns

sns.pairplot(dataframe[["quality", "citric_acid", "residual_sugar", "alcohol"]], diag_kind="kde")

--- Some considerations ---

Did you notice anything when looking at the stats table?

One useful piece of information we can get from those are, for example, min and max values. This allows us to understand ranges in which these features fall in.

Based on the description of the dataset and the task we are trying to achieve, do you see any issues with the examples we have available to train on?

Did you notice that the ratings on the dataset range from 3 to 9? In this dataset, there is no wine rated with a 10 or a 0 - 2 rating. This will likely produce a poor model that is not able to generalize well to examples of fantastic tasting wine (nor to the ones that taste pretty bad!). One way to fix this is to make sure your dataset represents all possible classes well. Another analysis, that we do not do on this exercise, is check if the data is balanced. Having a balanced dataset produces fair model, and that is always a good thing!

Split the data into train, validation and test

Now split the dataset into a training, validation, and test set.

Test sets are used for a final evaluation of the trained model.

There are more sophisticated ways to make sure that your splitting methods are repeatable. Ideally, the sets would always be the same after splitting to avoid randomic results, which makes experimentation difficult.


In [ ]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(dataframe, test_size=0.2)
train, val = train_test_split(train, test_size=0.2)
print(len(train), 'train examples')
print(len(val), 'validation examples')
print(len(test), 'test examples')

Use the tf.data.Dataset

The tf.data.Dataset allows for writing descriptive and efficient input pipelines. Dataset usage follows a common pattern:

  • Create a source dataset from your input data.
  • Apply dataset transformations to preprocess the data.
  • Iterate over the dataset and process the elements.

Iteration happens in a streaming fashion, so the full dataset does not need to fit into memory.

The df_to_dataset method below creates a dataset object from a pandas dataframe.


In [ ]:
def df_to_dataset(dataframe, epochs=10, shuffle=True, batch_size=64):
  dataframe = dataframe.copy()
  labels = tf.keras.utils.to_categorical(dataframe.pop('quality'), num_classes=11) #extracting the column which contains the training label
  ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
  if shuffle:
    ds = ds.shuffle(buffer_size=len(dataframe))
  ds = ds.repeat(epochs).batch(batch_size)
  return ds

Next step is to create batches from train, validation and test datasets that we split earlier. Let's use a batch size of 5 for demonstration purposes.


In [ ]:
train_ds = df_to_dataset(train)
val_ds = df_to_dataset(val, shuffle=False)
test_ds = df_to_dataset(test, shuffle=False)

Let's look at one batch of the data. The example below prints the content of a batch (column names, elements from the citric_acid column and elements from the quality label.


In [ ]:
for feature_batch, label_batch in train_ds.take(1):
  print('Every feature:', list(feature_batch.keys()))
  print('A batch of citric acid:', feature_batch['citric_acid'])
  print('A batch of quality:', label_batch )

Create feature columns

TensorFlow provides many types of feature columns. In this exercise, all the feature columns are of type numeric. If there were any text or categorical values, transformations would need to take place to make the input all numeric.

However, you often don't want to feed a number directly into the model, but instead split its value into different categories based on numerical ranges. To do this, use the bucketized_column method of feature columns. This allows for the network to represent discretized dense input bucketed by boundaries.

Feature columns are the object type used to create feature layers, which we will feed to the Keras model.


In [ ]:
from tensorflow import feature_column

feature_columns = []

fixed_acidity = tf.feature_column.numeric_column('fixed_acidity')
bucketized_fixed_acidity = tf.feature_column.bucketized_column(
    fixed_acidity, boundaries=[3., 5., 7., 9., 11., 13., 14.])
feature_columns.append(bucketized_fixed_acidity)

volatile_acidity = tf.feature_column.numeric_column('volatile_acidity')
bucketized_volatile_acidity = tf.feature_column.bucketized_column(
    volatile_acidity, boundaries=[0., 0.2, 0.4, 0.6, 0.8, 1.])
feature_columns.append(bucketized_volatile_acidity)

citric_acid = tf.feature_column.numeric_column('citric_acid')
bucketized_citric_acid = tf.feature_column.bucketized_column(
    citric_acid, boundaries=[0., 0.4, 0.7, 1.0, 1.3, 1.8])
feature_columns.append(bucketized_citric_acid)

residual_sugar = tf.feature_column.numeric_column('residual_sugar')
bucketized_residual_sugar = tf.feature_column.bucketized_column(
    residual_sugar, boundaries=[0.6, 10., 20., 30., 40., 50., 60., 70.])
feature_columns.append(bucketized_residual_sugar)

chlorides = tf.feature_column.numeric_column('chlorides')
bucketized_chlorides = tf.feature_column.bucketized_column(
    chlorides, boundaries=[0., 0.1, 0.2, 0.3, 0.4])
feature_columns.append(bucketized_chlorides)

free_sulfur_dioxide = tf.feature_column.numeric_column('free_sulfur_dioxide')
bucketized_free_sulfur_dioxide = tf.feature_column.bucketized_column(
    free_sulfur_dioxide, boundaries=[1., 50., 100., 150., 200., 250., 300.])
feature_columns.append(bucketized_free_sulfur_dioxide)

total_sulfur_dioxide = tf.feature_column.numeric_column('total_sulfur_dioxide')
bucketized_total_sulfur_dioxide = tf.feature_column.bucketized_column(
    total_sulfur_dioxide, boundaries=[9., 100., 200., 300., 400., 500.])
feature_columns.append(bucketized_total_sulfur_dioxide)

density = tf.feature_column.numeric_column('density')
bucketized_density = tf.feature_column.bucketized_column(
    density, boundaries=[0.9, 1.0, 1.1])
feature_columns.append(bucketized_density)

pH = tf.feature_column.numeric_column('pH')
bucketized_pH = tf.feature_column.bucketized_column(
    pH, boundaries=[2., 3., 4.])
feature_columns.append(bucketized_pH)

sulphates = tf.feature_column.numeric_column('sulphates')
bucketized_sulphates = tf.feature_column.bucketized_column(
    sulphates, boundaries=[0.2, 0.4, 0.7, 1.0, 1.1])
feature_columns.append(bucketized_sulphates)

alcohol = tf.feature_column.numeric_column('alcohol')
bucketized_alcohol = tf.feature_column.bucketized_column(
    alcohol, boundaries=[8., 9., 10., 11., 12., 13., 14.])
feature_columns.append(bucketized_alcohol)


feature_columns

In [ ]:
# Create a feature layer from the feature columns

feature_layer = tf.keras.layers.DenseFeatures(feature_columns)

Define, compile and train the Keras model

We will be using the Keras Sequential API to create the logistic regression model for the classification of the wine quality.

The model will be composed of the input layer (feature_layer created above), a single dense layer with two neural nodes, and the output layer, which will allow the model to predict the rating (1 - 10) of each instance being inferred.

When compiling the model, we define a loss function, an optimizer and which metrics to use to evaluate the model. CategoricalCrossentropy is a type of loss used in classification tasks. Losses are a mathematical way of measuring how wrong the model predictions are.

Optimizers tie together the loss function and model parameters by updating the model in response to the output of the loss function. In simpler terms, optimizers shape and mold your model into its most accurate possible form by playing with the weights. The loss function is the guide to the terrain, telling the optimizer when it’s moving in the right or wrong direction. We will use Adam as our optimizer for this exercise. Adam is an optimization algorithm that can be used instead of the classical stochastic gradient descent procedure to update network weights iterative based in training data.

There are many types of optimizers one can chose from. Ideally, when creating an ML model, try and identify an optimizer that has been empirically adopted on similar tasks.


In [ ]:
model = tf.keras.Sequential([
  feature_layer,
  layers.Dense(8, activation='relu'),
  layers.Dense(8, activation='relu'),
  layers.Dense(11, activation='softmax')
])

In [ ]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.CategoricalCrossentropy(from_logits=False),
              metrics=['accuracy'])

model.fit(train_ds,
          validation_data=val_ds,
          epochs=5)

When training a model, you want to evaluate its performance by looking at the loss and the chosen metric(s). The validation loss and accuracy will point out if the model is actually learning and able to generalize or if it is overfitting.

Conclusion

This notebook introduced a few concepts to handle a classification problem with Keras Sequential API.

  • We looked at some techniques to clean and inspect data.
  • We split the dataset into training, validation and test datasets.
  • We used the tf.data.Dataset to create an input pipeline.
  • We went over some basics on loss and optimizers.
  • We covered the steps to define, compile and train a model using the Keras Sequential API.