In this lab, you will use the Keras Sequential API to create a classification model. You will learn how to use the tf.data API for creating input pipelines and use feature columns to prepare the data to be consumed by a neural network.
This lab does not cover how to make predictions on the model or deploy it to Cloud AI Platform.
In a classification problem, we aim to select the output from a limited set of discrete values, like a category or a class. Contrast this with a regression problem, where we aim to predict a value from a continuos range of values.
This notebook uses the Wine Production Quality Dataset and builds a model to predict the production quality of wine given a set of attributes such as its citric acidity, density, and others.
To do this, we'll provide the model with examples of different wines produced, that received a rating from an evaluator. The ratings are provided by the numbers 0 - 10 (0 being of very poor quality and 10 being of great quality). We will then try and use this model to predict the rate a new wine will receive by infering towards the trained model.
Since we are learning how to use the Tensorflow 2.x API, this example uses the tf.keras
API. Please see this guide for details.
In [ ]:
# Ensure the right version of Tensorflow is installed.
!pip freeze | grep tensorflow==2.1 || pip install tensorflow==2.1
In [ ]:
from __future__ import absolute_import, division, print_function, unicode_literals
import pathlib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
print(tf.__version__)
The dataset is available in the UCI Machine Learning Repository.
In [ ]:
dataset_path = "gs://cloud-training-demos/wine_quality/winequality-white.csv"
In [ ]:
column_names = ['fixed_acidity','volatile_acidity','citric_acid','residual_sugar',
'chlorides','free_sulfur_dioxide','total_sulfur_dioxide','density',
'pH','sulphates','alcohol','quality']
In [ ]:
raw_dataframe = pd.read_csv(dataset_path, names=column_names, header = 0,
na_values = " ", comment='\t',
sep=";", skipinitialspace=True)
raw_dataframe = raw_dataframe.astype(float)
raw_dataframe['quality'] = raw_dataframe['quality'].astype(int)
dataframe= raw_dataframe.copy()
In [ ]:
dataframe.isna().sum()
We can see that, on this dataset, there are no null values.
If there were any, we could run dataframe = dataframe.dropna()
to drop them and make this tutorial simpler.
In [ ]:
dataframe.tail()
In [ ]:
data_stats = dataframe.describe()
data_stats = data_stats.transpose()
data_stats
Have a quick look at the joint distribution of a few pairs of columns from the training set:
In [ ]:
import seaborn as sns
sns.pairplot(dataframe[["quality", "citric_acid", "residual_sugar", "alcohol"]], diag_kind="kde")
Did you notice anything when looking at the stats table?
One useful piece of information we can get from those are, for example, min and max values. This allows us to understand ranges in which these features fall in.
Based on the description of the dataset and the task we are trying to achieve, do you see any issues with the examples we have available to train on?
Now split the dataset into a training, validation, and test set.
Test sets are used for a final evaluation of the trained model.
There are more sophisticated ways to make sure that your splitting methods are repeatable. Ideally, the sets would always be the same after splitting to avoid randomic results, which makes experimentation difficult.
In [ ]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(dataframe, test_size=0.2)
train, val = train_test_split(train, test_size=0.2)
print(len(train), 'train examples')
print(len(val), 'validation examples')
print(len(test), 'test examples')
The tf.data.Dataset allows for writing descriptive and efficient input pipelines. Dataset usage follows a common pattern:
Iteration happens in a streaming fashion, so the full dataset does not need to fit into memory.
The df_to_dataset
method below creates a dataset object from a pandas dataframe.
In [ ]:
def df_to_dataset(dataframe, epochs=10, shuffle=True, batch_size=64):
dataframe = dataframe.copy()
labels = tf.keras.utils.to_categorical(dataframe.pop('quality'), num_classes=11) #extracting the column which contains the training label
ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
if shuffle:
ds = ds.shuffle(buffer_size=len(dataframe))
ds = ds.repeat(epochs).batch(batch_size)
return ds
Next step is to create batches from train, validation and test datasets that we split earlier. Let's use a batch size of 5 for demonstration purposes.
In [ ]:
train_ds = df_to_dataset(train)
val_ds = df_to_dataset(val, shuffle=False)
test_ds = df_to_dataset(test, shuffle=False)
Let's look at one batch of the data. The example below prints the content of a batch (column names, elements from the citric_acid
column and elements from the quality
label.
In [ ]:
for feature_batch, label_batch in train_ds.take(1):
print('Every feature:', list(feature_batch.keys()))
print('A batch of citric acid:', feature_batch['citric_acid'])
print('A batch of quality:', label_batch )
TensorFlow provides many types of feature columns. In this exercise, all the feature columns are of type numeric
. If there were any text or categorical values, transformations would need to take place to make the input all numeric.
However, you often don't want to feed a number directly into the model, but instead split its value into different categories based on numerical ranges. To do this, use the bucketized_column
method of feature columns. This allows for the network to represent discretized dense input bucketed by boundaries.
Feature columns are the object type used to create feature layers
, which we will feed to the Keras model.
Lab Task # 1: Create a feature column by adding the input fields with the transformations that are needed.
In [ ]:
from tensorflow import feature_column
feature_columns = []
# TODO 1: Create input layer of feature columns
In [ ]:
# Create a feature layer from the feature columns
feature_layer = tf.keras.layers.DenseFeatures(feature_columns)
We will be using the Keras Sequential API to create the logistic regression model for the classification of the wine quality.
The model will be composed of the input layer (feature_layer created above), a single dense layer with two neural nodes, and the output layer, which will allow the model to predict the rating (1 - 10) of each instance being inferred.
When compiling the model, we define a loss function, an optimizer and which metrics to use to evaluate the model. CategoricalCrossentropy
is a type of loss used in classification tasks. Losses are a mathematical way of measuring how wrong the model predictions are.
Optimizers tie together the loss function and model parameters by updating the model in response to the output of the loss function. In simpler terms, optimizers shape and mold your model into its most accurate possible form by playing with the weights. The loss function is the guide to the terrain, telling the optimizer when it’s moving in the right or wrong direction. We will use Adam
as our optimizer for this exercise. Adam is an optimization algorithm that can be used instead of the classical stochastic gradient descent procedure to update network weights iterative based in training data.
There are many types of optimizers one can chose from. Ideally, when creating an ML model, try and identify an optimizer that has been empirically adopted on similar tasks.
Lab Task # 2: Create a deep neural network using Keras's Sequential API. In the cell below, use the tf.keras.layers library to create all the layers for your deep neural network.
In [ ]:
# Build a keras DNN model using Sequential API
model = # TODO2: Define the model.
In [ ]:
model.compile(optimizer='adam',
loss=tf.keras.losses.CategoricalCrossentropy(from_logits=False),
metrics=['accuracy'])
model.fit(train_ds,
validation_data=val_ds,
epochs=5)
When training a model, you want to evaluate its performance by looking at the loss and the chosen metric(s). The validation loss and accuracy will point out if the model is actually learning and able to generalize or if it is overfitting.
This notebook introduced a few concepts to handle a classification problem with Keras Sequential API.