In [0]:
#@title Copyright 2020 Google LLC. Double-click here for license information.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
After doing this Colab, you'll know how to do the following:
As in the previous exercise, this exercise uses the California Housing dataset to predict the median_house_value
at the city block level. Like many "famous" datasets, the California Housing Dataset actually consists of two separate datasets, each living in separate .csv files:
california_housing_train.csv
.california_housing_test.csv
.You'll create the validation set by dividing the downloaded training set into two parts:
In [0]:
#@title Run on TensorFlow 2.x
%tensorflow_version 2.x
In [0]:
#@title Import modules
import numpy as np
import pandas as pd
import tensorflow as tf
from matplotlib import pyplot as plt
pd.options.display.max_rows = 10
pd.options.display.float_format = "{:.1f}".format
In [0]:
train_df = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv")
test_df = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_test.csv")
In [0]:
scale_factor = 1000.0
# Scale the training set's label.
train_df["median_house_value"] /= scale_factor
# Scale the test set's label
test_df["median_house_value"] /= scale_factor
The following code cell defines two functions:
build_model
, which defines the model's topography.train_model
, which will ultimately train the model, outputting not only the loss value for the training set but also the loss value for the validation set. Since you don't need to understand model building code right now, we've hidden this code cell. As always, you must run hidden code cells.
In [0]:
#@title Define the functions that build and train a model
def build_model(my_learning_rate):
"""Create and compile a simple linear regression model."""
# Most simple tf.keras models are sequential.
model = tf.keras.models.Sequential()
# Add one linear layer to the model to yield a simple linear regressor.
model.add(tf.keras.layers.Dense(units=1, input_shape=(1,)))
# Compile the model topography into code that TensorFlow can efficiently
# execute. Configure training to minimize the model's mean squared error.
model.compile(optimizer=tf.keras.optimizers.RMSprop(lr=my_learning_rate),
loss="mean_squared_error",
metrics=[tf.keras.metrics.RootMeanSquaredError()])
return model
def train_model(model, df, feature, label, my_epochs,
my_batch_size=None, my_validation_split=0.1):
"""Feed a dataset into the model in order to train it."""
history = model.fit(x=df[feature],
y=df[label],
batch_size=my_batch_size,
epochs=my_epochs,
validation_split=my_validation_split)
# Gather the model's trained weight and bias.
trained_weight = model.get_weights()[0]
trained_bias = model.get_weights()[1]
# The list of epochs is stored separately from the
# rest of history.
epochs = history.epoch
# Isolate the root mean squared error for each epoch.
hist = pd.DataFrame(history.history)
rmse = hist["root_mean_squared_error"]
return epochs, rmse, history.history
print("Defined the build_model and train_model functions.")
In [0]:
#@title Define the plotting function
def plot_the_loss_curve(epochs, mae_training, mae_validation):
"""Plot a curve of loss vs. epoch."""
plt.figure()
plt.xlabel("Epoch")
plt.ylabel("Root Mean Squared Error")
plt.plot(epochs[1:], mae_training[1:], label="Training Loss")
plt.plot(epochs[1:], mae_validation[1:], label="Validation Loss")
plt.legend()
# We're not going to plot the first epoch, since the loss on the first epoch
# is often substantially greater than the loss for other epochs.
merged_mae_lists = mae_training[1:] + mae_validation[1:]
highest_loss = max(merged_mae_lists)
lowest_loss = min(merged_mae_lists)
delta = highest_loss - lowest_loss
print(delta)
top_of_y_axis = highest_loss + (delta * 0.05)
bottom_of_y_axis = lowest_loss - (delta * 0.05)
plt.ylim([bottom_of_y_axis, top_of_y_axis])
plt.show()
print("Defined the plot_the_loss_curve function.")
In the following code cell, you'll see a variable named validation_split
, which we've initialized at 0.2. The validation_split
variable specifies the proportion of the original training set that will serve as the validation set. The original training set contains 17,000 examples. Therefore, a validation_split
of 0.2 means that:
The following code builds a model, trains it on the training set, and evaluates the built model on both:
If the data in the training set is similar to the data in the validation set, then the two loss curves and the final loss values should be almost identical. However, the loss curves and final loss values are not almost identical. Hmm, that's odd.
Experiment with two or three different values of validation_split
. Do different values of validation_split
fix the problem?
In [0]:
# The following variables are the hyperparameters.
learning_rate = 0.08
epochs = 30
batch_size = 100
# Split the original training set into a reduced training set and a
# validation set.
validation_split=0.2
# Identify the feature and the label.
my_feature="median_income" # the median income on a specific city block.
my_label="median_house_value" # the median value of a house on a specific city block.
# That is, you're going to create a model that predicts house value based
# solely on the neighborhood's median income.
# Discard any pre-existing version of the model.
my_model = None
# Invoke the functions to build and train the model.
my_model = build_model(learning_rate)
epochs, rmse, history = train_model(my_model, train_df, my_feature,
my_label, epochs, batch_size,
validation_split)
plot_the_loss_curve(epochs, history["root_mean_squared_error"],
history["val_root_mean_squared_error"])
No matter how you split the training set and the validation set, the loss curves differ significantly. Evidently, the data in the training set isn't similar enough to the data in the validation set. Counterintuitive? Yes, but this problem is actually pretty common in machine learning.
Your task is to determine why the loss curves aren't highly similar. As with most issues in machine learning, the problem is rooted in the data itself. To solve this mystery of why the training set and validation set aren't almost identical, write a line or two of in the following code cell. Here are a couple of hints:
head
method outputs the first 5 rows of the DataFrame. To see more of the training set, specify the n
argument to head
and assign a large positive integer to n
.
In [0]:
# Write some code in this code cell.
In [0]:
#@title Double-click for a possible solution to Task 2.
# Examine examples 0 through 4 and examples 25 through 29
# of the training set
train_df.head(n=1000)
# The original training set is sorted by longitude.
# Apparently, longitude influences the relationship of
# total_rooms to median_house_value.
To fix the problem, shuffle the examples in the training set before splitting the examples into a training set and validation set. To do so, take the following steps:
train_model
(in the code cell associated with Task 1): shuffled_train_df = train_df.reindex(np.random.permutation(train_df.index))
shuffled_train_df
(instead of train_df
) as the second argument to train_model
(in the code call associated with Task 1) so that the call becomes as follows: epochs, rmse, history = train_model(my_model, shuffled_train_df, my_feature,
my_label, epochs, batch_size,
validation_split)
In [0]:
#@title Double-click to view the complete implementation.
# The following variables are the hyperparameters.
learning_rate = 0.08
epochs = 70
batch_size = 100
# Split the original training set into a reduced training set and a
# validation set.
validation_split=0.2
# Identify the feature and the label.
my_feature="median_income" # the median income on a specific city block.
my_label="median_house_value" # the median value of a house on a specific city block.
# That is, you're going to create a model that predicts house value based
# solely on the neighborhood's median income.
# Discard any pre-existing version of the model.
my_model = None
# Shuffle the examples.
shuffled_train_df = train_df.reindex(np.random.permutation(train_df.index))
# Invoke the functions to build and train the model. Train on the shuffled
# training set.
my_model = build_model(learning_rate)
epochs, rmse, history = train_model(my_model, shuffled_train_df, my_feature,
my_label, epochs, batch_size,
validation_split)
plot_the_loss_curve(epochs, history["root_mean_squared_error"],
history["val_root_mean_squared_error"])
Experiment with validation_split
to answer the following questions:
validation_split
do the final loss values for the training set and validation set diverge meaningfully? Why?
In [0]:
#@title Double-click for the answers to the questions
# Yes, after shuffling the original training set,
# the final loss for the training set and the
# validation set become much closer.
# If validation_split < 0.15,
# the final loss values for the training set and
# validation set diverge meaningfully. Apparently,
# the validation set no longer contains enough examples.
The test set usually acts as the ultimate judge of a model's quality. The test set can serve as an impartial judge because its examples haven't been used in training the model. Run the following code cell to evaluate the model with the test set:
In [0]:
x_test = test_df[my_feature]
y_test = test_df[my_label]
results = my_model.evaluate(x_test, y_test, batch_size=batch_size)
Compare the root mean squared error of the model when evaluated on each of the three datasets:
root_mean_squared_error
in the final training epoch.val_root_mean_squared_error
in the final training epoch.root_mean_squred_error
.Ideally, the root mean squared error of all three sets should be similar. Are they?
In [0]:
#@title Double-click for an answer
# In our experiments, yes, the rmse values
# were similar enough.