Learning Objectives:
LinearRegressor
class in TensorFlow to predict median housing price, at the granularity of city blocks, based on one input feature
In [4]:
# Load the necessary libraries
import math
from IPython import display
from matplotlib import cm, gridspec, pyplot as plt
import numpy as np
import pandas as pd
from sklearn import metrics
import tensorflow as tf
from tensorflow.python.data import Dataset
tf.logging.set_verbosity(tf.logging.ERROR)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format
In [122]:
# Load the dataset
california_housing_dataframe = pd.read_csv("https://storage.googleapis.com/mledu-datasets/california_housing_train.csv", sep=",")
We'll randomize the data, just to be sure not to get any pathological ordering effects that might harm the performance of Stochastic Gradient Descent. Additionally, we'll scale median_house_value
to be in units of thousands, so it can be learned a little more easily with learning rates in a range that we usually use.
In [123]:
california_housing_dataframe = california_housing_dataframe.reindex(np.random.permutation(california_housing_dataframe.index))
california_housing_dataframe["median_house_value"] /= 1000
california_housing_dataframe
Out[123]:
In [124]:
california_housing_dataframe.describe()
Out[124]:
In this exercise, we'll try to predict median_house_value
, which will be our label (sometimes also called a target). We'll use total_rooms
as our input feature.
NOTE: Our data is at the city block level, so this feature represents the total number of rooms in that block.
To train our model, we'll use the LinearRegressor interface provided by the TensorFlow Estimator API. This API takes care of a lot of the low-level model plumbing, and exposes convenient methods for performing model training, evaluation, and inference.
In order to import our training data into TensorFlow, we need to specify what type of data each feature contains. There are two main types of data we'll use in this and future exercises:
Categorical Data: Data that is textual. In this exercise, our housing data set does not contain any categorical features, but examples you might see would be the home style, the words in a real-estate ad.
Numerical Data: Data that is a number (integer or float) and that you want to treat as a number. As we will discuss more later sometimes you might want to treat numerical data (e.g., a postal code) as if it were categorical.
In TensorFlow, we indicate a feature's data type using a construct called a feature column. Feature columns store only a description of the feature data; they do not contain the feature data itself.
To start, we're going to use just one numeric input feature, total_rooms
. The following code pulls the total_rooms
data from our california_housing_dataframe
and defines the feature column using numeric_column
, which specifies its data is numeric:
In [132]:
# Define the input feature: total_rooms.
my_feature = california_housing_dataframe[["total_rooms"]]
# Configure a numeric feature column for total_rooms.
feature_columns = [tf.feature_column.numeric_column("total_rooms")]
In [133]:
# Define the label
targets = california_housing_dataframe["median_house_value"]
Next, we'll configure a linear regression model using LinearRegressor. We'll train this model using the GradientDescentOptimizer
, which implements Mini-Batch Stochastic Gradient Descent (SGD). The learning_rate
argument controls the size of the gradient step.
NOTE: To be safe, we also apply gradient clipping to our optimizer via clip_gradients_by_norm
. Gradient clipping ensures the magnitude of the gradients do not become too large during training, which can cause gradient descent to fail.
In [134]:
# Use gradient descent as the optimizer for training the model.
my_optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.0000001)
my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)
# Configure the linear regression model with our feature columns and optimizer.
# Set a learning rate of 0.0000001 for Gradient Descent.
linear_regressor = tf.estimator.LinearRegressor(
feature_columns=feature_columns,
optimizer=my_optimizer
)
To import our California housing data into our LinearRegressor
, we need to define an input function, which instructs TensorFlow how to preprocess
the data, as well as how to batch, shuffle, and repeat it during model training.
First, we'll convert our pandas feature data into a dict of NumPy arrays. We can then use the TensorFlow Dataset API to construct a dataset object from our data, and then break
our data into batches of batch_size
, to be repeated for the specified number of epochs (num_epochs).
NOTE: When the default value of num_epochs=None
is passed to repeat()
, the input data will be repeated indefinitely.
Next, if shuffle
is set to True
, we'll shuffle the data so that it's passed to the model randomly during training. The buffer_size
argument specifies
the size of the dataset from which shuffle
will randomly sample.
Finally, our input function constructs an iterator for the dataset and returns the next batch of data to the LinearRegressor.
In [ ]: