In [0]:
    
#@title Copyright 2020 Google LLC. Double-click for license information.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
    
After doing this Colab, you'll know how to:
tf.feature_column methods to represent features in different ways.Like several of the previous Colabs, this exercise uses the California Housing Dataset.
In [0]:
    
%tensorflow_version 2.x
    
In [0]:
    
#@title Load the imports
# from __future__ import absolute_import, division, print_function, unicode_literals
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import feature_column
from tensorflow.keras import layers
from matplotlib import pyplot as plt
# The following lines adjust the granularity of reporting.
pd.options.display.max_rows = 10
pd.options.display.float_format = "{:.1f}".format
tf.keras.backend.set_floatx('float32')
print("Imported the modules.")
    
The following code cell loads the separate .csv files and creates the following two pandas DataFrames:
train_df, which contains the training settest_df, which contains the test setThe code cell then scales the median_house_value to a more human-friendly range and then suffles the examples.
In [0]:
    
# Load the dataset
train_df = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv")
test_df = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_test.csv")
# Scale the labels
scale_factor = 1000.0
# Scale the training set's label.
train_df["median_house_value"] /= scale_factor 
# Scale the test set's label
test_df["median_house_value"] /= scale_factor
# Shuffle the examples
train_df = train_df.reindex(np.random.permutation(train_df.index))
    
Previous Colabs trained on only a single feature or a single synthetic feature. By contrast, this exercise trains on two features. Furthermore, this Colab introduces feature columns, which provide a sophisticated way to represent features.
You create feature columns as possible:
tf.feature_column method to represent a single feature, single feature cross, or single synthetic feature in the desired way.  For example, to represent a certain feature as floating-point values, call tf.feature_column.numeric_column. To represent a certain feature as a series of buckets or bins, call tf.feature_column.bucketized_column.A neighborhood's location is typically the most important feature in determining a house's value. The California Housing dataset provides two features, latitude and longitude that identify each neighborhood's location.
The following code cell calls tf.feature_column.numeric_column twice, first to represent latitude as floating-point value and a second time to represent longitude as floating-point values.
This code cell specifies the features that you'll ultimately train the model on and how each of those features will be represented. The transformations (collected in fp_feature_layer) don't actually get applied until you pass a DataFrame to it, which will happen when we train the model.
In [0]:
    
# Create an empty list that will eventually hold all feature columns.
feature_columns = []
# Create a numerical feature column to represent latitude.
latitude = tf.feature_column.numeric_column("latitude")
feature_columns.append(latitude)
# Create a numerical feature column to represent longitude.
longitude = tf.feature_column.numeric_column("longitude")
feature_columns.append(longitude)
# Convert the list of feature columns into a layer that will ultimately become
# part of the model. Understanding layers is not important right now.
fp_feature_layer = layers.DenseFeatures(feature_columns)
    
When used, the layer processes the raw inputs, according to the transformations described by the feature columns, and packs the result into a numeric array. (The model will train on this numeric array.)
The following code defines three functions:
create_model, which tells TensorFlow to build a linear regression model and to use the feature_layer_as_fp as the representation of the model's features.train_model, which will ultimately train the model from training set examples.plot_the_loss_curve, which generates a loss curve.
In [0]:
    
#@title Define functions to create and train a model, and a plotting function
def create_model(my_learning_rate, feature_layer):
  """Create and compile a simple linear regression model."""
  # Most simple tf.keras models are sequential.
  model = tf.keras.models.Sequential()
  # Add the layer containing the feature columns to the model.
  model.add(feature_layer)
  # Add one linear layer to the model to yield a simple linear regressor.
  model.add(tf.keras.layers.Dense(units=1, input_shape=(1,)))
  # Construct the layers into a model that TensorFlow can execute.
  model.compile(optimizer=tf.keras.optimizers.RMSprop(lr=my_learning_rate),
                loss="mean_squared_error",
                metrics=[tf.keras.metrics.RootMeanSquaredError()])
  return model           
def train_model(model, dataset, epochs, batch_size, label_name):
  """Feed a dataset into the model in order to train it."""
  features = {name:np.array(value) for name, value in dataset.items()}
  label = np.array(features.pop(label_name))
  history = model.fit(x=features, y=label, batch_size=batch_size,
                      epochs=epochs, shuffle=True)
  # The list of epochs is stored separately from the rest of history.
  epochs = history.epoch
  
  # Isolate the mean absolute error for each epoch.
  hist = pd.DataFrame(history.history)
  rmse = hist["root_mean_squared_error"]
  return epochs, rmse   
def plot_the_loss_curve(epochs, rmse):
  """Plot a curve of loss vs. epoch."""
  plt.figure()
  plt.xlabel("Epoch")
  plt.ylabel("Root Mean Squared Error")
  plt.plot(epochs, rmse, label="Loss")
  plt.legend()
  plt.ylim([rmse.min()*0.94, rmse.max()* 1.05])
  plt.show()  
print("Defined the create_model, train_model, and plot_the_loss_curve functions.")
    
In [0]:
    
# The following variables are the hyperparameters.
learning_rate = 0.05
epochs = 30
batch_size = 100
label_name = 'median_house_value'
# Create and compile the model's topography.
my_model = create_model(learning_rate, fp_feature_layer)
# Train the model on the training set.
epochs, rmse = train_model(my_model, train_df, epochs, batch_size, label_name)
plot_the_loss_curve(epochs, rmse)
print("\n: Evaluate the new model against the test set:")
test_features = {name:np.array(value) for name, value in test_df.items()}
test_label = np.array(test_features.pop(label_name))
my_model.evaluate(x=test_features, y=test_label, batch_size=batch_size)
    
In [0]:
    
#@title Double-click to view an answer to Task 1.
# No. Representing latitude and longitude as 
# floating-point values does not have much 
# predictive power. For example, neighborhoods at 
# latitude 35 are not 36/35 more valuable 
# (or 35/36 less valuable) than houses at 
# latitude 36.
# Representing `latitude` and `longitude` as 
# floating-point values provides almost no 
# predictive power. We're only using the raw values 
# to establish a baseline for future experiments 
# with better representations.
    
The following code cell represents latitude and longitude in buckets (bins). Each bin represents all the neighborhoods within a single degree. For example, neighborhoods at latitude 35.4 and 35.8 are in the same bucket, but neighborhoods in latitude 35.4 and 36.2 are in different buckets.
The model will learn a separate weight for each bucket. For example, the model will learn one weight for all the neighborhoods in the "35" bin", a different weight for neighborhoods in the "36" bin, and so on. This representation will create approximately 20 buckets:
latitude. longitude. 
In [0]:
    
resolution_in_degrees = 1.0 
# Create a new empty list that will eventually hold the generated feature column.
feature_columns = []
# Create a bucket feature column for latitude.
latitude_as_a_numeric_column = tf.feature_column.numeric_column("latitude")
latitude_boundaries = list(np.arange(int(min(train_df['latitude'])), 
                                     int(max(train_df['latitude'])), 
                                     resolution_in_degrees))
latitude = tf.feature_column.bucketized_column(latitude_as_a_numeric_column, 
                                               latitude_boundaries)
feature_columns.append(latitude)
# Create a bucket feature column for longitude.
longitude_as_a_numeric_column = tf.feature_column.numeric_column("longitude")
longitude_boundaries = list(np.arange(int(min(train_df['longitude'])), 
                                      int(max(train_df['longitude'])), 
                                      resolution_in_degrees))
longitude = tf.feature_column.bucketized_column(longitude_as_a_numeric_column, 
                                                longitude_boundaries)
feature_columns.append(longitude)
# Convert the list of feature columns into a layer that will ultimately become
# part of the model. Understanding layers is not important right now.
buckets_feature_layer = layers.DenseFeatures(feature_columns)
    
In [0]:
    
# The following variables are the hyperparameters.
learning_rate = 0.04
epochs = 35
# Build the model, this time passing in the buckets_feature_layer.
my_model = create_model(learning_rate, buckets_feature_layer)
# Train the model on the training set.
epochs, rmse = train_model(my_model, train_df, epochs, batch_size, label_name)
plot_the_loss_curve(epochs, rmse)
print("\n: Evaluate the new model against the test set:")
my_model.evaluate(x=test_features, y=test_label, batch_size=batch_size)
    
In [0]:
    
#@title Double-click for an answer to Task 2.
# Bucket representation outperformed 
# floating-point representations.  
# However, you can still do far better.
    
In [0]:
    
#@title Double-click to view an answer to Task 3.
# Representing location as a feature cross should 
# produce better results.
# In Task 2, you represented latitude in 
# one-dimensional buckets and longitude in 
# another series of one-dimensional buckets. 
# Real-world locations, however, exist in 
# two dimension. Therefore, you should
# represent location as a two-dimensional feature
# cross. That is, you'll cross the 10 or so latitude 
# buckets with the 10 or so longitude buckets to 
# create a grid of 100 cells. 
# The model will learn separate weights for each 
# of the cells.
    
In [0]:
    
resolution_in_degrees = 1.0 
# Create a new empty list that will eventually hold the generated feature column.
feature_columns = []
# Create a bucket feature column for latitude.
latitude_as_a_numeric_column = tf.feature_column.numeric_column("latitude")
latitude_boundaries = list(np.arange(int(min(train_df['latitude'])), int(max(train_df['latitude'])), resolution_in_degrees))
latitude = tf.feature_column.bucketized_column(latitude_as_a_numeric_column, latitude_boundaries)
# Create a bucket feature column for longitude.
longitude_as_a_numeric_column = tf.feature_column.numeric_column("longitude")
longitude_boundaries = list(np.arange(int(min(train_df['longitude'])), int(max(train_df['longitude'])), resolution_in_degrees))
longitude = tf.feature_column.bucketized_column(longitude_as_a_numeric_column, longitude_boundaries)
# Create a feature cross of latitude and longitude.
latitude_x_longitude = tf.feature_column.crossed_column([latitude, longitude], hash_bucket_size=100)
crossed_feature = tf.feature_column.indicator_column(latitude_x_longitude)
feature_columns.append(crossed_feature)
# Convert the list of feature columns into a layer that will later be fed into
# the model. 
feature_cross_feature_layer = layers.DenseFeatures(feature_columns)
    
Invoke the following code cell to test your solution for Task 3. Please ignore the warning messages.
In [0]:
    
# The following variables are the hyperparameters.
learning_rate = 0.04
epochs = 35
# Build the model, this time passing in the feature_cross_feature_layer: 
my_model = create_model(learning_rate, feature_cross_feature_layer)
# Train the model on the training set.
epochs, rmse = train_model(my_model, train_df, epochs, batch_size, label_name)
plot_the_loss_curve(epochs, rmse)
print("\n: Evaluate the new model against the test set:")
my_model.evaluate(x=test_features, y=test_label, batch_size=batch_size)
    
In [0]:
    
#@title Double-click for an answer to this question.
# Yes, representing these features as a feature 
# cross produced much lower loss values than 
# representing these features as buckets
    
Return to the code cell in the "Represent location as a feature cross" section. Notice that resolution_in_degrees is set to 1.0. Therefore, each cell represents an area of 1.0 degree of latitude by 1.0 degree of longitude, which corresponds to a cell of 110 km by 90 km.  This resolution defines a rather large neighborhood.
Experiment with resolution_in_degrees to answer the following questions:
resolution_in_degrees produces the best results (lowest loss value)?resolution_in_degrees drops below a certain value?Finally, answer the following question:
In [0]:
    
#@title Double-click for possible answers to Task 5.
#1. A resolution of ~0.4 degree provides the best 
#   results.
#2. Below ~0.4 degree, loss increases because the 
#   dataset does not contain enough examples in 
#   each cell to accurately predict prices for 
#   those cells.
#3. Postal code would be a far better feature 
#   than latitude X longitude, assuming that 
#   the dataset contained sufficient examples 
#   in each postal code.