In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Case Study: Debugging in Regression

In this Colab, you will learn how to debug a regression problem through a case study. You will:

  1. Set up the problem.
  2. Interpret the correlation matrix.
  3. Implement linear and nonlinear models.
  4. Compare and choose between linear and nonlinear models.
  5. Optimize your chosen model.
  6. Debug your chosen model.

Setup

This Colab uses the wine quality dataset[1], which is hosted at UCI. This dataset contains data on the physicochemical properties of wine along with wine quality ratings. The problem is to predict wine quality (0-10) from physicochemical properties.

Please make a copy of this Colab before running it. Click on File, and then click on Save a copy in Drive.

[1] Modeling wine preferences by data mining from physicochemical properties. P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Decision Support Systems, Elsevier, 47(4):547-553, 2009.

Load libraries and data by running the next cell. Display the first few rows to verify that the dataset loaded correctly.


In [0]:
# Reset environment for a new run
% reset -f

# Load libraries
from os.path import join # for joining file pathnames
import pandas as pd
import seaborn as sns
import tensorflow as tf
from tensorflow import keras
import numpy as np
import matplotlib.pyplot as plt

# Set Pandas display options
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format

wineDf = pd.read_csv(
  "https://download.mlcc.google.com/mledu-datasets/winequality.csv",
  encoding='latin-1')
wineDf.columns = ['fixed acidity','volatile acidity','citric acid',
                     'residual sugar','chlorides','free sulfur dioxide',
                     'total sulfur dioxide','density','pH',
                     'sulphates','alcohol','quality']
wineDf.head()

Check Correlation Matrix

Before developing your ML model, you need to select features. To find informative features, check the correlation matrix by running the following cell. Which features are informative?


In [0]:
corr_wineDf = wineDf.corr()
plt.figure(figsize=(16,10))
sns.heatmap(corr_wineDf, annot=True)

alcohol is most highly correlated with quality. Looking for other informative features, notice that volatile acidity correlates with quality but not with alcohol, making it a good second feature. Remember that a correlation matrix is not helpful if predictive signals are encoded in combinations of features.

Validate Input Data against Data Schema

Before processing your data, you should validate the data against a data schema as described in Data and Feature Debugging.

First, define a function that validates data against a schema.


In [0]:
#@title Define function to validate data

def test_data_schema(input_data, schema):
  """Tests that the datatypes and ranges of values in the dataset
    adhere to expectations.

    Args:
      input_function: Dataframe containing data to test
      schema: Schema which describes the properties of the data.
  """

  def test_dtypes():
    for column in schema.keys():
      assert input_data[column].map(type).eq(
          schema[column]['dtype']).all(), (
          "Incorrect dtype in column '%s'." % column
      )
    print('Input dtypes are correct.')

  def test_ranges():
    for column in schema.keys():
      schema_max = schema[column]['range']['max']
      schema_min = schema[column]['range']['min']
      # Assert that data falls between schema min and max.
      assert input_data[column].max() <= schema_max, (
          "Maximum value of column '%s' is too low." % column
      )
      assert input_data[column].min() >= schema_min, (
          "Minimum value of column '%s' is too high." % column
      )
    print('Data falls within specified ranges.')

  test_dtypes()
  test_ranges()

To define your schema, you need to understand the statistical properties of your dataset. Generate statistics on your dataset by running the following code cell:


In [0]:
wineDf.describe()

Using the statistics generated above, define the data schema in the following code cell. For demonstration purposes, restrict your data schema to the first three data columns. For each data column, enter the:

  • minimum value
  • maximum value
  • data type

As an example, the values for the first column are filled out. After entering the values, run the code cell to confirm that your input data matches the schema.


In [0]:
wine_schema = {
    'fixed acidity': {
        'range': {
            'min': 3.8,
            'max': 15.9
        },
        'dtype': float,
    },
    'volatile acidity': {
        'range': {
            'min': , # describe() rounds up this value, be careful
            'max':
        },
        'dtype': ,
    },
    'citric acid': {
        'range': {
            'min': ,
            'max':
        },
        'dtype': ,
    }
}

print('Validating wine data against data schema...')
test_data_schema(wineDf, wine_schema)

Solution


In [0]:
wine_schema = {
    'fixed acidity': {
        'range': {
            'min': 3.7,
            'max': 15.9
        },
        'dtype': float,
    },
    'volatile acidity': {
        'range': {
            'min': 0.08,  # minimum value
            'max': 1.6   # maximum value
        },
        'dtype': float,    # data type
    },
    'citric acid': {
        'range': {
            'min': 0.0, # minimum value
            'max': 1.7  # maximum value
        },
        'dtype': float,   # data type
    }
}

print('Validating wine data against data schema...')
test_data_schema(wineDf, wine_schema)

Split and Normalize Data

Split the dataset into data and labels.


In [0]:
wineFeatures = wineDf.copy(deep=True)
wineFeatures.drop(columns='quality',inplace=True)
wineLabels = wineDf['quality'].copy(deep=True)

Normalize data using z-score.


In [0]:
def normalizeData(arr):
  stdArr = np.std(arr)
  meanArr = np.mean(arr)
  arr = (arr-meanArr)/stdArr
  return arr

for str1 in wineFeatures.columns:
   wineFeatures[str1] = normalizeData(wineFeatures[str1])

Test Engineered Data

After normalizing your data, you should test your engineered data for errors as described in Data and Feature Debugging. In this section, you will test that engineering data:

  • Has the expected number of rows and columns.
  • Does not have null values.

First, set up the testing functions by running the following code cell:


In [0]:
import unittest

def test_input_dim(df, n_rows, n_columns):
  assert len(df) == n_rows, "Unexpected number of rows."
  assert len(df.columns) == n_columns, "Unexpected number of columns."
  print('Engineered data has the expected number of rows and columns.')

def test_nulls(df):
  dataNulls = df.isnull().sum().sum()
  assert dataNulls == 0, "Nulls in engineered data."
  print('Engineered features do not contain nulls.')

Your input data had 6497 examples and 11 feature columns. Test whether your engineered data has the expected number of rows and columns by running the following cell. Confirm that the test fails if you change the values below.


In [0]:
#@title Test dimensions of engineered data
wine_feature_rows = 6497 #@param
wine_feature_cols = 11 #@param
test_input_dim(wineFeatures,
               wine_feature_rows,
               wine_feature_cols)

Test that your engineered data does not contain nulls by running the code below.


In [0]:
test_nulls(wineFeatures)

Check Splits for Statistical Equivalence

As described in the Data Debugging guidelines, before developing your model, you should check that your training and validation splits are equally representative. Assuming a training:validation split of 80:20, compare the mean and the standard deviation of the splits by running the next two code cells. Note that this comparison is not a rigorous test for statistical equivalence but simply a quick and dirty comparison of the splits.


In [0]:
splitIdx = wineFeatures.shape[0]*8/10
wineFeatures.iloc[0:splitIdx,:].describe()

In [0]:
wineFeatures.iloc[splitIdx:-1,:].describe()

The two splits are clearly not equally representative. To make the splits equally representative, you can shuffle the data.

Run the following code cell to shuffle the data, and then recreate the features and labels from the shuffled data.


In [0]:
# Shuffle data
wineDf = wineDf.sample(frac=1).reset_index(drop=True)
# Recreate features and labels
wineFeatures = wineDf.copy(deep=True)
wineFeatures.drop(columns='quality',inplace=True)
wineLabels = wineDf['quality'].copy(deep=True)

Now, confirm that the splits are equally representative by regenerating and comparing the statistics using the previous code cells. You may wonder why the initial splits differed so greatly. It turns out that in the wine dataset, the first 4897 rows contain data on white wines and the next 1599 rows contain data on red wines. When you split your dataset 80:20, then your training dataset contains 5197 examples, which is 94% white wine. The validation dataset is purely red wine.

Ensuring your splits are statistically equivalent is a good development practice. In general, following good development practices will simplify your model debugging. To learn about testing for statistical equivalence, see Equivalence Tests Lakens, D..

Establish a Baseline

For a regression problem, the simplest baseline to predict the average value. Run the following code to calculate the mean-squared error (MSE) loss on the training split using the average value as a baseline. Your loss is approximately 0.75. Any model should beat this loss to justify its use.


In [0]:
baselineMSE = np.square(wineLabels[0:splitIdx]-np.mean(wineLabels[0:splitIdx]))
baselineMSE = np.sum(baselineMSE)/len(baselineMSE)
print(baselineMSE)

Linear Model

Following good ML dev practice, let's start with a linear model that uses the most informative feature from the correlation matrix: alcohol. Even if this model performs badly, we can still use it as a baseline. This model should beat our previous baseline's MSE of 0.75.

First, let's define a function to plot our loss and accuracy curves. The function will also print the final loss and accuracy. Instead of using verbose=1, you can call the function.


In [0]:
def showRegressionResults(trainHistory):
  """Function to:
   * Print final loss.
   * Plot loss curves.
  
  Args:
    trainHistory: object returned by model.fit
  """
  
  # Print final loss
  print("Final training loss: " + str(trainHistory.history['loss'][-1]))
  print("Final Validation loss: " + str(trainHistory.history['val_loss'][-1]))
  
  # Plot loss curves
  plt.plot(trainHistory.history['loss'])
  plt.plot(trainHistory.history['val_loss'])
  plt.legend(['Training loss','Validation loss'],loc='best')
  plt.title('Loss Curves')

For fast prototyping, let's try using a full batch per epoch to update the gradient only once per epoch. Use the full batch by setting batch_size = wineFeatures.shape[0] as indicated by the code comment.

What do you think of the loss curve? Can you improve it? For hints and discussion, see the following text cells.


In [0]:
model = None
# Choose feature
wineFeaturesSimple = wineFeatures['alcohol']
# Define model
model = keras.Sequential()
model.add(keras.layers.Dense(units=1, activation='linear', input_dim=1))
# Specify the optimizer using the TF API to specify the learning rate
model.compile(optimizer=tf.train.AdamOptimizer(learning_rate=0.01),
              loss='mse')
# Train the model!
trainHistory = model.fit(wineFeaturesSimple,
                         wineLabels,
                         epochs=50,
                         batch_size=, # set batch size here
                         validation_split=0.2,
                         verbose=0)
# Plot
showRegressionResults(trainHistory)

Hint

The loss decreases but very slowly. Possible fixes are:

  • Increase number of epochs.
  • Increase learning rate.
  • Decrease batch size. A lower batch size can result in larger decrease in loss per epoch, under the assumption that the smaller batches stay representative of the overall data distribution.

Play with these three parameters in the code above to decrease the loss.

Solution

Run the following code cell to train the model using a reduced batch size of 100. Reducing the batch size leads to a greater decrease in loss per epoch. The minimum achievable loss is about 0.64. This is a significant increase over our baseline of 0.75.


In [0]:
model = None
# Choose feature
wineFeaturesSimple = wineFeatures['alcohol']
# Define model
model = keras.Sequential()
model.add(keras.layers.Dense(units=1, activation='linear', input_dim=1))
# Specify the optimizer using the TF API to specify the learning rate
model.compile(optimizer=tf.train.AdamOptimizer(learning_rate=0.01),
              loss='mse')
# Train the model!
trainHistory = model.fit(wineFeaturesSimple,
                         wineLabels,
                         epochs=20,
                         batch_size=100, # set batch size here
                         validation_split=0.2,
                         verbose=0)
# Plot
showRegressionResults(trainHistory)

Add Feature to Linear Model

Try adding a feature to the linear model. Since you need to combine the two features into one prediction for regression, you'll also need to add a second layer. Modify the code below to implement the following changes:

  1. Add 'volatile acidity' to the features in wineFeaturesSimple.
  2. Add a second linear layer with 1 unit.
  3. Experiment with learning rate, epochs, and batch_size to try to reduce loss.

What happens to your loss?


In [0]:
model = None
# Select features
wineFeaturesSimple = wineFeatures[['alcohol', '...']] # add 'volatile acidity'
# Define model
model = keras.Sequential()
model.add(keras.layers.Dense(wineFeaturesSimple.shape[1],
                             input_dim=wineFeaturesSimple.shape[1],
                             activation='linear'))
model.add(...) # add second layer
# Compile
model.compile(optimizer=tf.train.AdamOptimizer(learning_rate=), loss='mse')
# Train
trainHistory = model.fit(wineFeaturesSimple,
                         wineLabels,
                         epochs=,
                         batch_size=,
                         validation_split=0.2,
                         verbose=0)
# Plot results
showRegressionResults(trainHistory)

Solution

Run the following code to add the second feature and the second layer. The training loss is about 0.59, a small decrease from the previous loss of 0.64.


In [0]:
model = None
# Select features
wineFeaturesSimple = wineFeatures[['alcohol', 'volatile acidity']] # add second feature
# Define model
model = keras.Sequential()
model.add(keras.layers.Dense(wineFeaturesSimple.shape[1],
                             input_dim=wineFeaturesSimple.shape[1],
                             activation='linear'))
model.add(keras.layers.Dense(1, activation='linear')) # add second layer
# Compile
model.compile(optimizer=tf.train.AdamOptimizer(learning_rate=0.01), loss='mse')
# Train
trainHistory = model.fit(wineFeaturesSimple,
                         wineLabels,
                         epochs=20,
                         batch_size=100,
                         validation_split=0.2,
                         verbose=0)
# Plot results
showRegressionResults(trainHistory)

Use a Nonlinear Model

Let's try a nonlinear model. Modify the code below to make the following changes:

  1. Change the first layer to use relu. (Output layer stays linear since this is a regression problem.)
  2. As usual, specify the learning rate, epochs, and batch_size.

Run the cell. Does the loss increase, decrease, or stay the same?


In [0]:
model = None
# Define
model = keras.Sequential()
model.add(keras.layers.Dense(wineFeaturesSimple.shape[1],
                             input_dim=wineFeaturesSimple.shape[1],
                             activation=))
model.add(keras.layers.Dense(1, activation='linear'))
# Compile
model.compile(optimizer=tf.train.AdamOptimizer(), loss='mse')
# Fit
model.fit(wineFeaturesSimple,
          wineLabels,
          epochs=,
          batch_size=,
          validation_split=0.2,
          verbose=0)
# Plot results
showRegressionResults(trainHistory)

Solution

Run the following cell to use a relu activation in your first hidden layer. Your loss stays about the same, perhaps declining negligibly to 0.58.


In [0]:
model = None
# Define
model = keras.Sequential()
model.add(keras.layers.Dense(wineFeaturesSimple.shape[1],
                             input_dim=wineFeaturesSimple.shape[1],
                             activation='relu'))
model.add(keras.layers.Dense(1, activation='linear'))
# Compile
model.compile(optimizer=tf.train.AdamOptimizer(), loss='mse')
# Fit
model.fit(wineFeaturesSimple,
          wineLabels,
          epochs=20,
          batch_size=100,
          validation_split=0.2,
          verbose=0)
# Plot results
showRegressionResults(trainHistory)

Optimize Your Model

We have two features with one hidden layer but didn't see an improvement. At this point, it's tempting to use all your features with a high-capacity network. However, you must resist the temptation. Instead, follow the guidance in Model Optimization to improve model performance. For a hint and for a discussion, see the following text sections.


In [0]:
# Choose features
wineFeaturesSimple = wineFeatures[['alcohol', 'volatile acidity']] # add features
# Define
model = None
model = keras.Sequential()
model.add(keras.layers.Dense(wineFeaturesSimple.shape[1],
                             activation='relu',
                             input_dim=wineFeaturesSimple.shape[1]))
# Add more layers here
model.add(keras.layers.Dense(1,activation='linear'))
# Compile
model.compile(optimizer=tf.train.AdamOptimizer(), loss='mse')
# Train
trainHistory = model.fit(wineFeaturesSimple,
                         wineLabels,
                         epochs=,
                         batch_size=,
                         validation_split=0.2,
                         verbose=0)
# Plot results
showRegressionResults(trainHistory)

Hint

You can try to reduce loss by adding features, adding layers, or playing with the hyperparameters. Before adding more features, check the correlation matrix. Don't expect your loss to decrease by much. Sadly, that is a common experience in machine learning!

Solution

Run the following code to:

  • Add the features chlorides and density.
  • Set training epochs to 100.
  • Set batch size to 100.

Your loss reduces to about 0.56. That's a minor improvement over the previous loss of 0.58. It seems that adding more features or capacity isn't improving your model by much. Perhaps your model has a bug? In the next section, you will run a sanity check on your model.


In [0]:
# Choose features
wineFeaturesSimple = wineFeatures[['alcohol','volatile acidity','chlorides','density']]
# Define
model = None
model = keras.Sequential()
model.add(keras.layers.Dense(wineFeaturesSimple.shape[1],
                             activation='relu',
                             input_dim=wineFeaturesSimple.shape[1]))
# Add more layers here
model.add(keras.layers.Dense(1,activation='linear'))
# Compile
model.compile(optimizer=tf.train.AdamOptimizer(), loss='mse')
# Train
trainHistory = model.fit(wineFeaturesSimple,
                         wineLabels,
                         epochs=200,
                         batch_size=100,
                         validation_split=0.2,
                         verbose=0)
# Plot results
showRegressionResults(trainHistory)

Check for Implementation Bugs using Reduced Dataset

Your loss isn't decreasing by much. Perhaps your model has an implementation bug. From the Model Debugging guidelines, a quick test for implementation bugs is to obtain a low loss on a reduced dataset of, say, 10 examples. Remember, passing this test does not validate your modeling approach but only checks for basic implementation bugs. In your ML problem, if your model passes this test, then continue debugging your model to train on your full dataset.

In the following code, experiment with the learning rate, batch size, and number of epochs. Can you reach a low loss? Choose hyperparameter values that let you iterate quickly.


In [0]:
# Choose 10 examples
wineFeaturesSmall = wineFeatures[0:10]
wineLabelsSmall = wineLabels[0:10]
# Define model
model = None
model = keras.Sequential()
model.add(keras.layers.Dense(wineFeaturesSmall.shape[1],
                             activation='relu',
                             input_dim=wineFeaturesSmall.shape[1]))
model.add(keras.layers.Dense(wineFeaturesSmall.shape[1], activation='relu'))
model.add(keras.layers.Dense(1, activation='linear'))
# Compile
model.compile(optimizer=tf.train.AdamOptimizer(), loss='mse') # set LR
# Train
trainHistory = model.fit(wineFeaturesSmall,
                        wineLabelsSmall,
                        epochs=,
                        batch_size=,
                        verbose=0)
# Plot results
print("Final training loss: " + str(trainHistory.history['loss'][-1]))
plt.plot(trainHistory.history['loss'])

Solution

Run the following code cell to train the model using these hyperparameter values:

  • learning rate = 0.01
  • epochs = 200
  • batch size = 10

You get a low loss on your reduced dataset. This result means your model is probably solid and your previous results are as good as they'll get.


In [0]:
# Choose 10 examples
wineFeaturesSmall = wineFeatures[0:10]
wineLabelsSmall = wineLabels[0:10]
# Define model
model = None
model = keras.Sequential()
model.add(keras.layers.Dense(wineFeaturesSmall.shape[1], activation='relu',
                             input_dim=wineFeaturesSmall.shape[1]))
model.add(keras.layers.Dense(wineFeaturesSmall.shape[1], activation='relu'))
model.add(keras.layers.Dense(1, activation='linear'))
# Compile
model.compile(optimizer=tf.train.AdamOptimizer(0.01), loss='mse') # set LR
# Train
trainHistory = model.fit(wineFeaturesSmall,
                        wineLabelsSmall,
                        epochs=200,
                        batch_size=10,
                        verbose=0)
# Plot results
print("Final training loss: " + str(trainHistory.history['loss'][-1]))
plt.plot(trainHistory.history['loss'])

Trying a Very Complex Model

Let's go all in and use a very complex model with all the features. For science! And to satisfy ourselves that a simple model is indeed better. Let's use all 11 features with 3 fully-connected relu layers and a final linear layer. The next cell takes a while to run. Skip to the results in the cell after if you like.


In [0]:
model = None
# Define
model = keras.Sequential()
model.add(keras.layers.Dense(wineFeatures.shape[1], activation='relu',
                             input_dim=wineFeatures.shape[1]))
model.add(keras.layers.Dense(wineFeatures.shape[1], activation='relu'))
model.add(keras.layers.Dense(wineFeatures.shape[1], activation='relu'))
model.add(keras.layers.Dense(1,activation='linear'))
# Compile
model.compile(optimizer=tf.train.AdamOptimizer(), loss='mse')
# Train the model!
trainHistory = model.fit(wineFeatures, wineLabels, epochs=100, batch_size=100,
                         verbose=1, validation_split = 0.2)
# Plot results
showRegressionResults(trainHistory)
plt.ylim(0.4,1)

Results

If you train for long enough, the minimum achievable MSE is around 0.52, which is a decrease of 0.04 from the previous best loss of 0.56. This decrease probably isn't worth the performance and complexity cost of using all your features and a deeper network. However, that decision depends on the objectives you're optimizing for.

If you train the model long enough, the validation loss actually starts increasing while training loss continues to decrease. This divergence in loss curves means your model is overfitting. The overfitting results from the closer fit that your very complex model can learn. Stick with the simpler model. You'll be happier and live longer!

If you do want to optimize your loss, then play with the model to find the minimum achievable training loss before overfitting sets in. Try playing with the network parameters in the code cell above to achieve a loss of 0.51. But be warned—optimizing your loss could take a lot of trial and error..

Conclusion

This Colab demonstrated the following principles:

  • The most important step in machine learning is understanding your data.
  • The largest gains come from the initial features and network.
  • Returns diminish as you add features and complexity.
  • Incremental development provides confidence in model quality and allows benchmarking against previous results.
  • Reproducing previous results is extremely important. Hence, always use version control.