In [0]:
#@title Copyright 2020 Google LLC. Double-click for license information.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
After doing this Colab, you'll know how to:
tf.feature_column
methods to represent features in different ways.Like several of the previous Colabs, this exercise uses the California Housing Dataset.
In [0]:
%tensorflow_version 2.x
In [0]:
#@title Load the imports
# from __future__ import absolute_import, division, print_function, unicode_literals
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import feature_column
from tensorflow.keras import layers
from matplotlib import pyplot as plt
# The following lines adjust the granularity of reporting.
pd.options.display.max_rows = 10
pd.options.display.float_format = "{:.1f}".format
tf.keras.backend.set_floatx('float32')
print("Imported the modules.")
The following code cell loads the separate .csv files and creates the following two pandas DataFrames:
train_df
, which contains the training settest_df
, which contains the test setThe code cell then scales the median_house_value
to a more human-friendly range and then suffles the examples.
In [0]:
# Load the dataset
train_df = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv")
test_df = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_test.csv")
# Scale the labels
scale_factor = 1000.0
# Scale the training set's label.
train_df["median_house_value"] /= scale_factor
# Scale the test set's label
test_df["median_house_value"] /= scale_factor
# Shuffle the examples
train_df = train_df.reindex(np.random.permutation(train_df.index))
Previous Colabs trained on only a single feature or a single synthetic feature. By contrast, this exercise trains on two features. Furthermore, this Colab introduces feature columns, which provide a sophisticated way to represent features.
You create feature columns as possible:
tf.feature_column
method to represent a single feature, single feature cross, or single synthetic feature in the desired way. For example, to represent a certain feature as floating-point values, call tf.feature_column.numeric_column
. To represent a certain feature as a series of buckets or bins, call tf.feature_column.bucketized_column
.A neighborhood's location is typically the most important feature in determining a house's value. The California Housing dataset provides two features, latitude
and longitude
that identify each neighborhood's location.
The following code cell calls tf.feature_column.numeric_column
twice, first to represent latitude
as floating-point value and a second time to represent longitude
as floating-point values.
This code cell specifies the features that you'll ultimately train the model on and how each of those features will be represented. The transformations (collected in fp_feature_layer
) don't actually get applied until you pass a DataFrame to it, which will happen when we train the model.
In [0]:
# Create an empty list that will eventually hold all feature columns.
feature_columns = []
# Create a numerical feature column to represent latitude.
latitude = tf.feature_column.numeric_column("latitude")
feature_columns.append(latitude)
# Create a numerical feature column to represent longitude.
longitude = tf.feature_column.numeric_column("longitude")
feature_columns.append(longitude)
# Convert the list of feature columns into a layer that will ultimately become
# part of the model. Understanding layers is not important right now.
fp_feature_layer = layers.DenseFeatures(feature_columns)
When used, the layer processes the raw inputs, according to the transformations described by the feature columns, and packs the result into a numeric array. (The model will train on this numeric array.)
The following code defines three functions:
create_model
, which tells TensorFlow to build a linear regression model and to use the feature_layer_as_fp
as the representation of the model's features.train_model
, which will ultimately train the model from training set examples.plot_the_loss_curve
, which generates a loss curve.
In [0]:
#@title Define functions to create and train a model, and a plotting function
def create_model(my_learning_rate, feature_layer):
"""Create and compile a simple linear regression model."""
# Most simple tf.keras models are sequential.
model = tf.keras.models.Sequential()
# Add the layer containing the feature columns to the model.
model.add(feature_layer)
# Add one linear layer to the model to yield a simple linear regressor.
model.add(tf.keras.layers.Dense(units=1, input_shape=(1,)))
# Construct the layers into a model that TensorFlow can execute.
model.compile(optimizer=tf.keras.optimizers.RMSprop(lr=my_learning_rate),
loss="mean_squared_error",
metrics=[tf.keras.metrics.RootMeanSquaredError()])
return model
def train_model(model, dataset, epochs, batch_size, label_name):
"""Feed a dataset into the model in order to train it."""
features = {name:np.array(value) for name, value in dataset.items()}
label = np.array(features.pop(label_name))
history = model.fit(x=features, y=label, batch_size=batch_size,
epochs=epochs, shuffle=True)
# The list of epochs is stored separately from the rest of history.
epochs = history.epoch
# Isolate the mean absolute error for each epoch.
hist = pd.DataFrame(history.history)
rmse = hist["root_mean_squared_error"]
return epochs, rmse
def plot_the_loss_curve(epochs, rmse):
"""Plot a curve of loss vs. epoch."""
plt.figure()
plt.xlabel("Epoch")
plt.ylabel("Root Mean Squared Error")
plt.plot(epochs, rmse, label="Loss")
plt.legend()
plt.ylim([rmse.min()*0.94, rmse.max()* 1.05])
plt.show()
print("Defined the create_model, train_model, and plot_the_loss_curve functions.")
In [0]:
# The following variables are the hyperparameters.
learning_rate = 0.05
epochs = 30
batch_size = 100
label_name = 'median_house_value'
# Create and compile the model's topography.
my_model = create_model(learning_rate, fp_feature_layer)
# Train the model on the training set.
epochs, rmse = train_model(my_model, train_df, epochs, batch_size, label_name)
plot_the_loss_curve(epochs, rmse)
print("\n: Evaluate the new model against the test set:")
test_features = {name:np.array(value) for name, value in test_df.items()}
test_label = np.array(test_features.pop(label_name))
my_model.evaluate(x=test_features, y=test_label, batch_size=batch_size)
In [0]:
#@title Double-click to view an answer to Task 1.
# No. Representing latitude and longitude as
# floating-point values does not have much
# predictive power. For example, neighborhoods at
# latitude 35 are not 36/35 more valuable
# (or 35/36 less valuable) than houses at
# latitude 36.
# Representing `latitude` and `longitude` as
# floating-point values provides almost no
# predictive power. We're only using the raw values
# to establish a baseline for future experiments
# with better representations.
The following code cell represents latitude and longitude in buckets (bins). Each bin represents all the neighborhoods within a single degree. For example, neighborhoods at latitude 35.4 and 35.8 are in the same bucket, but neighborhoods in latitude 35.4 and 36.2 are in different buckets.
The model will learn a separate weight for each bucket. For example, the model will learn one weight for all the neighborhoods in the "35" bin", a different weight for neighborhoods in the "36" bin, and so on. This representation will create approximately 20 buckets:
latitude
. longitude
.
In [0]:
resolution_in_degrees = 1.0
# Create a new empty list that will eventually hold the generated feature column.
feature_columns = []
# Create a bucket feature column for latitude.
latitude_as_a_numeric_column = tf.feature_column.numeric_column("latitude")
latitude_boundaries = list(np.arange(int(min(train_df['latitude'])),
int(max(train_df['latitude'])),
resolution_in_degrees))
latitude = tf.feature_column.bucketized_column(latitude_as_a_numeric_column,
latitude_boundaries)
feature_columns.append(latitude)
# Create a bucket feature column for longitude.
longitude_as_a_numeric_column = tf.feature_column.numeric_column("longitude")
longitude_boundaries = list(np.arange(int(min(train_df['longitude'])),
int(max(train_df['longitude'])),
resolution_in_degrees))
longitude = tf.feature_column.bucketized_column(longitude_as_a_numeric_column,
longitude_boundaries)
feature_columns.append(longitude)
# Convert the list of feature columns into a layer that will ultimately become
# part of the model. Understanding layers is not important right now.
buckets_feature_layer = layers.DenseFeatures(feature_columns)
In [0]:
# The following variables are the hyperparameters.
learning_rate = 0.04
epochs = 35
# Build the model, this time passing in the buckets_feature_layer.
my_model = create_model(learning_rate, buckets_feature_layer)
# Train the model on the training set.
epochs, rmse = train_model(my_model, train_df, epochs, batch_size, label_name)
plot_the_loss_curve(epochs, rmse)
print("\n: Evaluate the new model against the test set:")
my_model.evaluate(x=test_features, y=test_label, batch_size=batch_size)
In [0]:
#@title Double-click for an answer to Task 2.
# Bucket representation outperformed
# floating-point representations.
# However, you can still do far better.
In [0]:
#@title Double-click to view an answer to Task 3.
# Representing location as a feature cross should
# produce better results.
# In Task 2, you represented latitude in
# one-dimensional buckets and longitude in
# another series of one-dimensional buckets.
# Real-world locations, however, exist in
# two dimension. Therefore, you should
# represent location as a two-dimensional feature
# cross. That is, you'll cross the 10 or so latitude
# buckets with the 10 or so longitude buckets to
# create a grid of 100 cells.
# The model will learn separate weights for each
# of the cells.
In [0]:
resolution_in_degrees = 1.0
# Create a new empty list that will eventually hold the generated feature column.
feature_columns = []
# Create a bucket feature column for latitude.
latitude_as_a_numeric_column = tf.feature_column.numeric_column("latitude")
latitude_boundaries = list(np.arange(int(min(train_df['latitude'])), int(max(train_df['latitude'])), resolution_in_degrees))
latitude = tf.feature_column.bucketized_column(latitude_as_a_numeric_column, latitude_boundaries)
# Create a bucket feature column for longitude.
longitude_as_a_numeric_column = tf.feature_column.numeric_column("longitude")
longitude_boundaries = list(np.arange(int(min(train_df['longitude'])), int(max(train_df['longitude'])), resolution_in_degrees))
longitude = tf.feature_column.bucketized_column(longitude_as_a_numeric_column, longitude_boundaries)
# Create a feature cross of latitude and longitude.
latitude_x_longitude = tf.feature_column.crossed_column([latitude, longitude], hash_bucket_size=100)
crossed_feature = tf.feature_column.indicator_column(latitude_x_longitude)
feature_columns.append(crossed_feature)
# Convert the list of feature columns into a layer that will later be fed into
# the model.
feature_cross_feature_layer = layers.DenseFeatures(feature_columns)
Invoke the following code cell to test your solution for Task 3. Please ignore the warning messages.
In [0]:
# The following variables are the hyperparameters.
learning_rate = 0.04
epochs = 35
# Build the model, this time passing in the feature_cross_feature_layer:
my_model = create_model(learning_rate, feature_cross_feature_layer)
# Train the model on the training set.
epochs, rmse = train_model(my_model, train_df, epochs, batch_size, label_name)
plot_the_loss_curve(epochs, rmse)
print("\n: Evaluate the new model against the test set:")
my_model.evaluate(x=test_features, y=test_label, batch_size=batch_size)
In [0]:
#@title Double-click for an answer to this question.
# Yes, representing these features as a feature
# cross produced much lower loss values than
# representing these features as buckets
Return to the code cell in the "Represent location as a feature cross" section. Notice that resolution_in_degrees
is set to 1.0. Therefore, each cell represents an area of 1.0 degree of latitude by 1.0 degree of longitude, which corresponds to a cell of 110 km by 90 km. This resolution defines a rather large neighborhood.
Experiment with resolution_in_degrees
to answer the following questions:
resolution_in_degrees
produces the best results (lowest loss value)?resolution_in_degrees
drops below a certain value?Finally, answer the following question:
In [0]:
#@title Double-click for possible answers to Task 5.
#1. A resolution of ~0.4 degree provides the best
# results.
#2. Below ~0.4 degree, loss increases because the
# dataset does not contain enough examples in
# each cell to accurately predict prices for
# those cells.
#3. Postal code would be a far better feature
# than latitude X longitude, assuming that
# the dataset contained sufficient examples
# in each postal code.