LAB 03: Basic Feature Engineering in Keras

Learning Objectives

  1. Create an input pipeline using tf.data
  2. Engineer features to create categorical, crossed, and numerical feature columns

Introduction

In this lab, we utilize feature engineering to improve the prediction of housing prices using a Keras Sequential Model.

Each learning objective will correspond to a #TODO in the student lab notebook -- try to complete that notebook first before reviewing this solution notebook.

Start by importing the necessary libraries for this lab.


In [1]:
# Install Sklearn
!python3 -m pip install --user sklearn

# Ensure the right version of Tensorflow is installed.
!pip3 freeze | grep 'tensorflow==2\|tensorflow-gpu==2' || \
!python3 -m pip install --user tensorflow==2


Collecting sklearn
  Downloading sklearn-0.0.tar.gz (1.1 kB)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.5/site-packages (from sklearn) (0.19.2)
Requirement already satisfied: intel-scipy in /usr/local/lib/python3.5/dist-packages (from scikit-learn->sklearn) (1.1.0)
Requirement already satisfied: pydaal in /usr/local/lib/python3.5/dist-packages (from scikit-learn->sklearn) (2019.0.0.20180713)
Requirement already satisfied: intel-numpy in /usr/local/lib/python3.5/dist-packages (from intel-scipy->scikit-learn->sklearn) (1.15.1)
Requirement already satisfied: daal==2019.* in /usr/local/lib/python3.5/dist-packages (from pydaal->scikit-learn->sklearn) (2019.0)
Requirement already satisfied: tbb4py==2019.* in /usr/local/lib/python3.5/dist-packages (from pydaal->scikit-learn->sklearn) (2019.0)
Requirement already satisfied: mkl-random in /usr/local/lib/python3.5/dist-packages (from intel-numpy->intel-scipy->scikit-learn->sklearn) (1.0.1.1)
Requirement already satisfied: icc-rt in /usr/local/lib/python3.5/dist-packages (from intel-numpy->intel-scipy->scikit-learn->sklearn) (2020.0.133)
Requirement already satisfied: mkl-fft in /usr/local/lib/python3.5/dist-packages (from intel-numpy->intel-scipy->scikit-learn->sklearn) (1.0.6)
Requirement already satisfied: mkl in /usr/local/lib/python3.5/dist-packages (from intel-numpy->intel-scipy->scikit-learn->sklearn) (2019.0)
Requirement already satisfied: tbb==2019.* in /usr/local/lib/python3.5/dist-packages (from daal==2019.*->pydaal->scikit-learn->sklearn) (2019.0)
Requirement already satisfied: intel-openmp==2020.* in /usr/local/lib/python3.5/dist-packages (from icc-rt->intel-numpy->intel-scipy->scikit-learn->sklearn) (2020.0.133)
Building wheels for collected packages: sklearn
  Building wheel for sklearn (setup.py) ... done
  Created wheel for sklearn: filename=sklearn-0.0-py2.py3-none-any.whl size=2397 sha256=9c96a4ca8cb038ff9b7c32929c1816184d0daaeead3b4e184ca1adb81730b4a2
  Stored in directory: /home/jupyter/.cache/pip/wheels/9e/ec/a6/33cdb5605b0b150074213e154792654a1006e6e6807dc7ca6f
Successfully built sklearn
Installing collected packages: sklearn
Successfully installed sklearn-0.0
tensorflow-gpu==2.1.0

In [40]:
import os
import tensorflow.keras

import matplotlib.pyplot as plt
import pandas as pd
import tensorflow as tf

from tensorflow import feature_column as fc
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split
from keras.utils import plot_model

print("TensorFlow version: ",tf.version.VERSION)


TensorFlow version:  2.1.0

Many of the Google Machine Learning Courses Programming Exercises use the California Housing Dataset, which contains data drawn from the 1990 U.S. Census. Our lab dataset has been pre-processed so that there are no missing values.

First, let's download the raw .csv data by copying the data from a cloud storage bucket.


In [5]:
if not os.path.isdir("../data"):
    os.makedirs("../data")

In [6]:
!gsutil cp gs://cloud-training-demos/feat_eng/housing/housing_pre-proc.csv ../data


Copying gs://cloud-training-demos/feat_eng/housing/housing_pre-proc.csv...
/ [1 files][  1.4 MiB/  1.4 MiB]                                                
Operation completed over 1 objects/1.4 MiB.                                      

In [9]:
!ls -l ../data/


total 6832
-rw-r--r-- 1 jupyter jupyter 1435069 Feb 21 22:46 housing_pre-proc.csv
-rw-r--r-- 1 jupyter jupyter 1113292 Feb 21 16:14 taxi-test.csv
-rw-r--r-- 1 jupyter jupyter 3551735 Feb 21 16:14 taxi-train.csv
-rw-r--r-- 1 jupyter jupyter  888648 Feb 21 16:14 taxi-valid.csv

Now, let's read in the dataset just copied from the cloud storage bucket and create a Pandas dataframe.


In [11]:
housing_df = pd.read_csv('../data/housing_pre-proc.csv', error_bad_lines=False)
housing_df.head()


Out[11]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY

We can use .describe() to see some summary statistics for the numeric fields in our dataframe. Note, for example, the count row and corresponding columns. The count shows 20433.000000 for all feature columns. Thus, there are no missing values.


In [12]:
housing_df.describe()


Out[12]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
count 20433.000000 20433.000000 20433.000000 20433.000000 20433.000000 20433.000000 20433.000000 20433.000000 20433.000000
mean -119.570689 35.633221 28.633094 2636.504233 537.870553 1424.946949 499.433465 3.871162 206864.413155
std 2.003578 2.136348 12.591805 2185.269567 421.385070 1133.208490 382.299226 1.899291 115435.667099
min -124.350000 32.540000 1.000000 2.000000 1.000000 3.000000 1.000000 0.499900 14999.000000
25% -121.800000 33.930000 18.000000 1450.000000 296.000000 787.000000 280.000000 2.563700 119500.000000
50% -118.490000 34.260000 29.000000 2127.000000 435.000000 1166.000000 409.000000 3.536500 179700.000000
75% -118.010000 37.720000 37.000000 3143.000000 647.000000 1722.000000 604.000000 4.744000 264700.000000
max -114.310000 41.950000 52.000000 39320.000000 6445.000000 35682.000000 6082.000000 15.000100 500001.000000

Split the dataset for ML

The dataset we loaded was a single CSV file. We will split this into train, validation, and test sets.


In [13]:
train, test = train_test_split(housing_df, test_size=0.2)
train, val = train_test_split(train, test_size=0.2)

print(len(train), 'train examples')
print(len(val), 'validation examples')
print(len(test), 'test examples')


13076 train examples
3270 validation examples
4087 test examples

Now, we need to output the split files. We will specifically need the test.csv later for testing. You should see the files appear in the home directory.


In [14]:
train.to_csv('../data/housing-train.csv', encoding='utf-8', index=False)

In [15]:
val.to_csv('../data/housing-val.csv', encoding='utf-8', index=False)

In [17]:
test.to_csv('../data/housing-test.csv', encoding='utf-8', index=False)

In [18]:
!head ../data/housing*.csv


==> ../data/housing-test.csv <==
longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
-117.73,34.12,26.0,6459.0,894.0,2487.0,885.0,6.2089,261800.0,INLAND
-122.3,37.55,35.0,3675.0,735.0,1930.0,715.0,3.9833,342800.0,NEAR OCEAN
-118.37,33.88,27.0,1688.0,331.0,811.0,327.0,4.5357,334200.0,<1H OCEAN
-117.91,33.78,33.0,2729.0,549.0,2223.0,535.0,4.0362,177900.0,<1H OCEAN
-117.92,33.88,32.0,1632.0,244.0,575.0,235.0,5.3986,318700.0,<1H OCEAN
-117.06,32.64,30.0,4494.0,667.0,1883.0,680.0,5.766,186100.0,NEAR OCEAN
-119.69,36.25,35.0,2011.0,349.0,970.0,300.0,2.395,94100.0,INLAND
-122.57,37.98,49.0,2860.0,552.0,1178.0,522.0,4.625,355000.0,NEAR BAY
-121.69,39.36,29.0,2220.0,471.0,1170.0,428.0,2.3224,56200.0,INLAND

==> ../data/housing-train.csv <==
longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
-116.92,32.81,17.0,1312.0,394.0,836.0,337.0,1.6686,112500.0,<1H OCEAN
-122.26,38.04,41.0,2512.0,539.0,1179.0,480.0,2.694,123000.0,NEAR BAY
-119.23,35.77,26.0,2636.0,468.0,1416.0,485.0,4.1917,84000.0,INLAND
-122.1,37.66,36.0,1305.0,225.0,768.0,234.0,4.275,185300.0,NEAR BAY
-118.28,33.92,39.0,1472.0,302.0,1036.0,318.0,3.0,110000.0,<1H OCEAN
-120.01,39.26,26.0,1930.0,391.0,307.0,138.0,2.6023,139300.0,INLAND
-116.5,33.82,16.0,343.0,85.0,29.0,14.0,2.1042,87500.0,INLAND
-122.11,37.98,11.0,4371.0,679.0,1790.0,660.0,6.135,297300.0,NEAR BAY
-117.9,33.61,44.0,1469.0,312.0,507.0,266.0,3.4937,500001.0,<1H OCEAN

==> ../data/housing-val.csv <==
longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
-118.51,35.16,7.0,4371.0,727.0,1932.0,654.0,4.625,136800.0,INLAND
-118.24,33.9,38.0,2055.0,442.0,1518.0,425.0,2.3382,103000.0,<1H OCEAN
-122.82,38.41,32.0,701.0,182.0,489.0,168.0,2.785,169300.0,<1H OCEAN
-118.24,33.93,32.0,779.0,201.0,861.0,219.0,1.0625,89800.0,<1H OCEAN
-117.0,32.85,24.0,1888.0,319.0,950.0,319.0,5.282,140800.0,<1H OCEAN
-121.29,37.99,45.0,965.0,198.0,498.0,195.0,1.6944,75200.0,INLAND
-117.66,33.6,25.0,3745.0,522.0,1648.0,496.0,7.5488,278100.0,<1H OCEAN
-122.28,37.85,52.0,2246.0,472.0,1005.0,449.0,2.4167,152700.0,NEAR BAY
-122.49,38.31,27.0,3078.0,597.0,1411.0,586.0,3.25,195500.0,<1H OCEAN

==> ../data/housing_pre-proc.csv <==
longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
-122.25,37.85,52.0,919.0,213.0,413.0,193.0,4.0368,269700.0,NEAR BAY
-122.25,37.84,52.0,2535.0,489.0,1094.0,514.0,3.6591,299200.0,NEAR BAY
-122.25,37.84,52.0,3104.0,687.0,1157.0,647.0,3.12,241400.0,NEAR BAY
-122.26,37.84,42.0,2555.0,665.0,1206.0,595.0,2.0804,226700.0,NEAR BAY

Create an input pipeline using tf.data

Next, we will wrap the dataframes with tf.data. This will enable us to use feature columns as a bridge to map from the columns in the Pandas dataframe to features used to train the model.

Here, we create an input pipeline using tf.data. This function is missing two lines. Correct and run the cell.


In [19]:
# A utility method to create a tf.data dataset from a Pandas Dataframe
# TODO 1
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
    dataframe = dataframe.copy()
    labels = dataframe.pop('median_house_value')
    ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
    if shuffle:
        ds = ds.shuffle(buffer_size=len(dataframe))
    ds = ds.batch(batch_size)
    return ds

Next we initialize the training and validation datasets.


In [20]:
batch_size = 32
train_ds = df_to_dataset(train)
val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)

Now that we have created the input pipeline, let's call it to see the format of the data it returns. We have used a small batch size to keep the output readable.


In [21]:
# TODO 1
for feature_batch, label_batch in train_ds.take(1):
    print('Every feature:', list(feature_batch.keys()))
    print('A batch of households:', feature_batch['households'])
    print('A batch of ocean_proximity:', feature_batch['ocean_proximity'])
    print('A batch of targets:', label_batch)


Every feature: ['total_bedrooms', 'total_rooms', 'ocean_proximity', 'median_income', 'longitude', 'population', 'latitude', 'housing_median_age', 'households']
A batch of households: tf.Tensor(
[ 341.  567. 1067.   89.  525.  217.  898.  589.  406.  211.  465.  247.
  359.  625.  198.  893.  206.  721.  200.  355.  624.  135.  328.  181.
   66.  504.  979.  440.  224.  535.  615.  218.], shape=(32,), dtype=float64)
A batch of ocean_proximity: tf.Tensor(
[b'<1H OCEAN' b'NEAR OCEAN' b'<1H OCEAN' b'NEAR OCEAN' b'<1H OCEAN'
 b'NEAR BAY' b'<1H OCEAN' b'<1H OCEAN' b'NEAR OCEAN' b'NEAR BAY'
 b'NEAR BAY' b'NEAR BAY' b'INLAND' b'INLAND' b'INLAND' b'<1H OCEAN'
 b'NEAR BAY' b'<1H OCEAN' b'<1H OCEAN' b'<1H OCEAN' b'NEAR BAY'
 b'NEAR BAY' b'INLAND' b'NEAR BAY' b'INLAND' b'INLAND' b'<1H OCEAN'
 b'<1H OCEAN' b'NEAR OCEAN' b'<1H OCEAN' b'<1H OCEAN' b'INLAND'], shape=(32,), dtype=string)
A batch of targets: tf.Tensor(
[164500. 231300. 203400. 273900. 185200. 333300. 348800. 218700. 329500.
 129500. 170200. 500001.  74600.  54400.  96900. 477300. 149700. 261100.
 183000. 145200. 210800. 157500. 140100. 344000.  52500. 121200. 500001.
 162800. 112500. 143000. 127200.  63900.], shape=(32,), dtype=float64)

We can see that the dataset returns a dictionary of column names (from the dataframe) that map to column values from rows in the dataframe.

Numeric columns

The output of a feature column becomes the input to the model. A numeric is the simplest type of column. It is used to represent real valued features. When using this column, your model will receive the column value from the dataframe unchanged.

In the California housing prices dataset, most columns from the dataframe are numeric. Let' create a variable called numeric_cols to hold only the numerical feature columns.


In [22]:
# TODO 1
numeric_cols = ['longitude', 'latitude', 'housing_median_age', 'total_rooms',
                'total_bedrooms', 'population', 'households', 'median_income']

Scaler function

It is very important for numerical variables to get scaled before they are "fed" into the neural network. Here we use min-max scaling. Here we are creating a function named 'get_scal' which takes a list of numerical features and returns a 'minmax' function, which will be used in tf.feature_column.numeric_column() as normalizer_fn in parameters. 'Minmax' function itself takes a 'numerical' number from a particular feature and return scaled value of that number.

Next, we scale the numerical feature columns that we assigned to the variable "numeric cols".


In [23]:
# Scalar def get_scal(feature):
# TODO 1
def get_scal(feature):
    def minmax(x):
        mini = train[feature].min()
        maxi = train[feature].max()
        return (x - mini)/(maxi-mini)
        return(minmax)

In [24]:
# TODO 1
feature_columns = []
for header in numeric_cols:
    scal_input_fn = get_scal(header)
    feature_columns.append(fc.numeric_column(header,
                                             normalizer_fn=scal_input_fn))

Next, we should validate the total number of feature columns. Compare this number to the number of numeric features you input earlier.


In [25]:
print('Total number of feature coLumns: ', len(feature_columns))


Total number of feature coLumns:  8

Using the Keras Sequential Model

Next, we will run this cell to compile and fit the Keras Sequential model.


In [26]:
# Model create
feature_layer = tf.keras.layers.DenseFeatures(feature_columns, dtype='float64')

model = tf.keras.Sequential([
  feature_layer,
  layers.Dense(12, input_dim=8, activation='relu'),
  layers.Dense(8, activation='relu'),
  layers.Dense(1, activation='linear',  name='median_house_value')
])

# Model compile
model.compile(optimizer='adam',
              loss='mse',
              metrics=['mse'])

# Model Fit
history = model.fit(train_ds,
                    validation_data=val_ds,
                    epochs=32)


Train for 409 steps, validate for 103 steps
Epoch 1/32
409/409 [==============================] - 3s 7ms/step - loss: 46448450617.1024 - mse: 46455398400.0000 - val_loss: 30496066977.5534 - val_mse: 30345949184.0000
Epoch 2/32
409/409 [==============================] - 2s 5ms/step - loss: 27358006279.5660 - mse: 27364796416.0000 - val_loss: 26113448790.9903 - val_mse: 25989636096.0000
Epoch 3/32
409/409 [==============================] - 2s 5ms/step - loss: 26557686567.8750 - mse: 26558568448.0000 - val_loss: 25628598898.3301 - val_mse: 25509109760.0000
Epoch 4/32
409/409 [==============================] - 2s 5ms/step - loss: 26004593578.8736 - mse: 26006214656.0000 - val_loss: 25086774649.7864 - val_mse: 24966019072.0000
Epoch 5/32
409/409 [==============================] - 2s 5ms/step - loss: 25353917142.1981 - mse: 25359810560.0000 - val_loss: 24392426346.8738 - val_mse: 24276228096.0000
Epoch 6/32
409/409 [==============================] - 2s 5ms/step - loss: 24539155025.7988 - mse: 24551505920.0000 - val_loss: 23548652414.7573 - val_mse: 23432804352.0000
Epoch 7/32
409/409 [==============================] - 2s 5ms/step - loss: 23494111225.5349 - mse: 23495385088.0000 - val_loss: 22390547794.0194 - val_mse: 22282233856.0000
Epoch 8/32
409/409 [==============================] - 2s 5ms/step - loss: 22182795612.6290 - mse: 22190280704.0000 - val_loss: 20994081001.6311 - val_mse: 20892551168.0000
Epoch 9/32
409/409 [==============================] - 2s 5ms/step - loss: 20585903645.4015 - mse: 20569870336.0000 - val_loss: 19335449058.1748 - val_mse: 19242725376.0000
Epoch 10/32
409/409 [==============================] - 2s 5ms/step - loss: 18715895996.9971 - mse: 18705152000.0000 - val_loss: 17429645416.3883 - val_mse: 17344231424.0000
Epoch 11/32
409/409 [==============================] - 2s 5ms/step - loss: 16664346753.9289 - mse: 16672082944.0000 - val_loss: 15520907328.6214 - val_mse: 15448484864.0000
Epoch 12/32
409/409 [==============================] - 2s 5ms/step - loss: 14691278532.1835 - mse: 14699193344.0000 - val_loss: 13750330917.2816 - val_mse: 13691328512.0000
Epoch 13/32
409/409 [==============================] - 2s 5ms/step - loss: 13056609219.8011 - mse: 13065994240.0000 - val_loss: 12550072832.0000 - val_mse: 12505309184.0000
Epoch 14/32
409/409 [==============================] - 2s 5ms/step - loss: 12084994099.4403 - mse: 12080374784.0000 - val_loss: 11936656080.7767 - val_mse: 11904313344.0000
Epoch 15/32
409/409 [==============================] - 2s 5ms/step - loss: 11600224334.6277 - mse: 11593289728.0000 - val_loss: 11702882219.4951 - val_mse: 11679186944.0000
Epoch 16/32
409/409 [==============================] - 2s 5ms/step - loss: 11414487135.0066 - mse: 11411551232.0000 - val_loss: 11603847123.2621 - val_mse: 11585359872.0000
Epoch 17/32
409/409 [==============================] - 2s 5ms/step - loss: 11331262045.0630 - mse: 11332776960.0000 - val_loss: 11560417155.7282 - val_mse: 11544253440.0000
Epoch 18/32
409/409 [==============================] - 2s 5ms/step - loss: 11291198817.7139 - mse: 11289479168.0000 - val_loss: 11570508059.3398 - val_mse: 11556798464.0000
Epoch 19/32
409/409 [==============================] - 2s 5ms/step - loss: 11245258445.4658 - mse: 11243645952.0000 - val_loss: 11478747121.0874 - val_mse: 11464867840.0000
Epoch 20/32
409/409 [==============================] - 2s 5ms/step - loss: 11210740782.6404 - mse: 11208387584.0000 - val_loss: 11444833488.7767 - val_mse: 11430981632.0000
Epoch 21/32
409/409 [==============================] - 2s 5ms/step - loss: 11167696639.0480 - mse: 11170749440.0000 - val_loss: 11437642916.0388 - val_mse: 11425636352.0000
Epoch 22/32
409/409 [==============================] - 2s 5ms/step - loss: 11125137414.7783 - mse: 11129923584.0000 - val_loss: 11359984421.2816 - val_mse: 11347213312.0000
Epoch 23/32
409/409 [==============================] - 2s 5ms/step - loss: 11071227646.3516 - mse: 11077665792.0000 - val_loss: 11326483530.5631 - val_mse: 11314733056.0000
Epoch 24/32
409/409 [==============================] - 2s 5ms/step - loss: 11042410394.9676 - mse: 11046813696.0000 - val_loss: 11377244423.4563 - val_mse: 11364465664.0000
Epoch 25/32
409/409 [==============================] - 2s 6ms/step - loss: 11019032031.5657 - mse: 11023687680.0000 - val_loss: 11260152777.3204 - val_mse: 11248880640.0000
Epoch 26/32
409/409 [==============================] - 2s 6ms/step - loss: 10965436818.2626 - mse: 10971379712.0000 - val_loss: 11229627600.7767 - val_mse: 11217687552.0000
Epoch 27/32
409/409 [==============================] - 2s 5ms/step - loss: 10949453183.0780 - mse: 10945600512.0000 - val_loss: 11193287158.0583 - val_mse: 11183252480.0000
Epoch 28/32
409/409 [==============================] - 2s 5ms/step - loss: 10911945940.0635 - mse: 10916561920.0000 - val_loss: 11139523519.3786 - val_mse: 11128924160.0000
Epoch 29/32
409/409 [==============================] - 2s 5ms/step - loss: 10887403733.1946 - mse: 10891365376.0000 - val_loss: 11132160094.4466 - val_mse: 11123118080.0000
Epoch 30/32
409/409 [==============================] - 2s 5ms/step - loss: 10837585640.2128 - mse: 10839391232.0000 - val_loss: 11066389931.4951 - val_mse: 11056680960.0000
Epoch 31/32
409/409 [==============================] - 2s 5ms/step - loss: 10806513138.6968 - mse: 10810738688.0000 - val_loss: 11016093964.4272 - val_mse: 11006873600.0000
Epoch 32/32
409/409 [==============================] - 2s 5ms/step - loss: 10765003210.7868 - mse: 10758446080.0000 - val_loss: 10965273684.5049 - val_mse: 10956495872.0000

Next we show loss as Mean Square Error (MSE). Remember that MSE is the most commonly used regression loss function. MSE is the sum of squared distances between our target variable (e.g. housing median age) and predicted values.


In [27]:
loss, mse = model.evaluate(train_ds)
print("Mean Squared Error", mse)


409/409 [==============================] - 1s 4ms/step - loss: 10718007119.4914 - mse: 10718919680.0000
Mean Squared Error 10718920000.0

Visualize the model loss curve

Next, we will use matplotlib to draw the model's loss curves for training and validation. A line plot is also created showing the mean squared error loss over the training epochs for both the train (blue) and test (orange) sets.


In [29]:
def plot_curves(history, metrics):
    nrows = 1
    ncols = 2
    fig = plt.figure(figsize=(10, 5))

    for idx, key in enumerate(metrics):  
        ax = fig.add_subplot(nrows, ncols, idx+1)
        plt.plot(history.history[key])
        plt.plot(history.history['val_{}'.format(key)])
        plt.title('model {}'.format(key))
        plt.ylabel(key)
        plt.xlabel('epoch')
        plt.legend(['train', 'validation'], loc='upper left');

In [30]:
plot_curves(history, ['loss', 'mse'])


Load test data

Next, we read in the test.csv file and validate that there are no null values.

Again, we can use .describe() to see some summary statistics for the numeric fields in our dataframe. The count shows 4087.000000 for all feature columns. Thus, there are no missing values.


In [31]:
test_data = pd.read_csv('../data/housing-test.csv')
test_data.describe()


Out[31]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
count 4087.000000 4087.000000 4087.000000 4087.000000 4087.000000 4087.000000 4087.000000 4087.000000 4087.000000
mean -119.540771 35.594710 28.599462 2649.056276 542.098361 1439.021532 504.180328 3.837996 207878.661121
std 2.011479 2.119069 12.534898 2211.250288 428.624000 1169.846216 390.876211 1.878277 116402.263079
min -124.300000 32.570000 1.000000 6.000000 2.000000 5.000000 2.000000 0.499900 22500.000000
25% -121.805000 33.930000 18.000000 1445.500000 298.000000 792.000000 282.000000 2.561250 118900.000000
50% -118.470000 34.240000 29.000000 2127.000000 437.000000 1171.000000 410.000000 3.517200 180400.000000
75% -117.980000 37.690000 37.000000 3170.500000 655.000000 1732.500000 610.000000 4.700800 267450.000000
max -114.310000 41.840000 52.000000 32627.000000 6445.000000 28566.000000 6082.000000 15.000100 500001.000000

Now that we have created an input pipeline using tf.data and compiled a Keras Sequential Model, we now create the input function for the test data and to initialize the test_predict variable.


In [32]:
# TODO 1
def test_input_fn(features, batch_size=256):
    """An input function for prediction."""
    # Convert the inputs to a Dataset without labels.
    return tf.data.Dataset.from_tensor_slices(dict(features)).batch(batch_size)

In [33]:
test_predict = test_input_fn(dict(test_data))

Prediction: Linear Regression

Before we begin to feature engineer our feature columns, we should predict the median house value. By predicting the median house value now, we can then compare it with the median house value after feature engineeing.

To predict with Keras, you simply call model.predict() and pass in the housing features you want to predict the median_house_value for. Note: We are predicting the model locally.


In [34]:
predicted_median_house_value = model.predict(test_predict)

Next, we run two predictions in separate cells - one where ocean_proximity=INLAND and one where ocean_proximity= NEAR OCEAN.


In [35]:
# Ocean_proximity is INLAND
model.predict({
    'longitude': tf.convert_to_tensor([-121.86]),
    'latitude': tf.convert_to_tensor([39.78]),
    'housing_median_age': tf.convert_to_tensor([12.0]),
    'total_rooms': tf.convert_to_tensor([7653.0]),
    'total_bedrooms': tf.convert_to_tensor([1578.0]),
    'population': tf.convert_to_tensor([3628.0]),
    'households': tf.convert_to_tensor([1494.0]),
    'median_income': tf.convert_to_tensor([3.0905]),
    'ocean_proximity': tf.convert_to_tensor(['INLAND'])
}, steps=1)


Out[35]:
array([[226396.88]], dtype=float32)

In [36]:
# Ocean_proximity is NEAR OCEAN
model.predict({
    'longitude': tf.convert_to_tensor([-122.43]),
    'latitude': tf.convert_to_tensor([37.63]),
    'housing_median_age': tf.convert_to_tensor([34.0]),
    'total_rooms': tf.convert_to_tensor([4135.0]),
    'total_bedrooms': tf.convert_to_tensor([687.0]),
    'population': tf.convert_to_tensor([2154.0]),
    'households': tf.convert_to_tensor([742.0]),
    'median_income': tf.convert_to_tensor([4.9732]),
    'ocean_proximity': tf.convert_to_tensor(['NEAR OCEAN'])
}, steps=1)


Out[36]:
array([[243165.66]], dtype=float32)

The arrays returns a predicted value. What do these numbers mean? Let's compare this value to the test set.

Go to the test.csv you read in a few cells up. Locate the first line and find the median_house_value - which should be 249,000 dollars near the ocean. What value did your model predicted for the median_house_value? Was it a solid model performance? Let's see if we can improve this a bit with feature engineering!

Engineer features to create categorical and numerical features

Now we create a cell that indicates which features will be used in the model.
Note: Be sure to bucketize 'housing_median_age' and ensure that 'ocean_proximity' is one-hot encoded. And, don't forget your numeric values!


In [37]:
# TODO 2
numeric_cols = ['longitude', 'latitude', 'housing_median_age', 'total_rooms',
                'total_bedrooms', 'population', 'households', 'median_income']

bucketized_cols = ['housing_median_age']

# indicator columns,Categorical features
categorical_cols = ['ocean_proximity']

Next, we scale the numerical, bucktized, and categorical feature columns that we assigned to the variables in the precding cell.


In [41]:
# Scalar def get_scal(feature):
def get_scal(feature):
    def minmax(x):
        mini = train[feature].min()
        maxi = train[feature].max()
        return (x - mini)/(maxi-mini)
        return(minmax)

In [42]:
# All numerical features - scaling
feature_columns = []
for header in numeric_cols:
    scal_input_fn = get_scal(header)
    feature_columns.append(fc.numeric_column(header,
                                             normalizer_fn=scal_input_fn))

Categorical Feature

In this dataset, 'ocean_proximity' is represented as a string. We cannot feed strings directly to a model. Instead, we must first map them to numeric values. The categorical vocabulary columns provide a way to represent strings as a one-hot vector.

Next, we create a categorical feature using 'ocean_proximity'.


In [43]:
# TODO 2
for feature_name in categorical_cols:
    vocabulary = housing_df[feature_name].unique()
    categorical_c = fc.categorical_column_with_vocabulary_list(feature_name, vocabulary)
    one_hot = fc.indicator_column(categorical_c)
    feature_columns.append(one_hot)

Bucketized Feature

Often, you don't want to feed a number directly into the model, but instead split its value into different categories based on numerical ranges. Consider our raw data that represents a homes' age. Instead of representing the house age as a numeric column, we could split the home age into several buckets using a bucketized column. Notice the one-hot values below describe which age range each row matches.

Next we create a bucketized column using 'housing_median_age'


In [44]:
# TODO 2
age = fc.numeric_column("housing_median_age")

# Bucketized cols
age_buckets = fc.bucketized_column(age, boundaries=[10, 20, 30, 40, 50, 60, 80, 100])
feature_columns.append(age_buckets)

Feature Cross

Combining features into a single feature, better known as feature crosses, enables a model to learn separate weights for each combination of features.

Next, we create a feature cross of 'housing_median_age' and 'ocean_proximity'.


In [45]:
# TODO 2
vocabulary = housing_df['ocean_proximity'].unique()
ocean_proximity = fc.categorical_column_with_vocabulary_list('ocean_proximity',
                                                             vocabulary)

crossed_feature = fc.crossed_column([age_buckets, ocean_proximity],
                                    hash_bucket_size=1000)
crossed_feature = fc.indicator_column(crossed_feature)
feature_columns.append(crossed_feature)

Next, we should validate the total number of feature columns. Compare this number to the number of numeric features you input earlier.


In [46]:
print('Total number of feature coumns: ', len(feature_columns))


Total number of feature coumns:  11

Next, we will run this cell to compile and fit the Keras Sequential model. This is the same model we ran earlier.


In [47]:
# Model create
feature_layer = tf.keras.layers.DenseFeatures(feature_columns,
                                              dtype='float64')

model = tf.keras.Sequential([
  feature_layer,
  layers.Dense(12, input_dim=8, activation='relu'),
  layers.Dense(8, activation='relu'),
  layers.Dense(1, activation='linear',  name='median_house_value')
])

# Model compile
model.compile(optimizer='adam',
              loss='mse',
              metrics=['mse'])

# Model Fit
history = model.fit(train_ds,
                    validation_data=val_ds,
                    epochs=32)


WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow_core/python/feature_column/feature_column_v2.py:4267: IndicatorColumn._variable_shape (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow_core/python/feature_column/feature_column_v2.py:4322: CrossedColumn._num_buckets (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow_core/python/feature_column/feature_column_v2.py:4322: VocabularyListCategoricalColumn._num_buckets (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
Train for 409 steps, validate for 103 steps
Epoch 1/32
409/409 [==============================] - 4s 9ms/step - loss: 49146764052.5340 - mse: 49144778752.0000 - val_loss: 34209257949.2039 - val_mse: 34057150464.0000
Epoch 2/32
409/409 [==============================] - 3s 7ms/step - loss: 27997332032.3571 - mse: 27993100288.0000 - val_loss: 26047439215.8447 - val_mse: 25920360448.0000
Epoch 3/32
409/409 [==============================] - 3s 7ms/step - loss: 26427177651.5119 - mse: 26428006400.0000 - val_loss: 25563897498.0971 - val_mse: 25443672064.0000
Epoch 4/32
409/409 [==============================] - 3s 7ms/step - loss: 26013434790.8618 - mse: 26013229056.0000 - val_loss: 25181936570.4078 - val_mse: 25062318080.0000
Epoch 5/32
409/409 [==============================] - 3s 7ms/step - loss: 25592102569.2986 - mse: 25592338432.0000 - val_loss: 24772717319.4563 - val_mse: 24655456256.0000
Epoch 6/32
409/409 [==============================] - 3s 7ms/step - loss: 25164181618.0027 - mse: 25166411776.0000 - val_loss: 24340632397.0485 - val_mse: 24225607680.0000
Epoch 7/32
409/409 [==============================] - 3s 7ms/step - loss: 24711333987.3497 - mse: 24720551936.0000 - val_loss: 23899074112.6214 - val_mse: 23783766016.0000
Epoch 8/32
409/409 [==============================] - 3s 7ms/step - loss: 24192186816.2675 - mse: 24207112192.0000 - val_loss: 23392034219.4951 - val_mse: 23279341568.0000
Epoch 9/32
409/409 [==============================] - 3s 7ms/step - loss: 23672812069.2911 - mse: 23670767616.0000 - val_loss: 22855638264.5437 - val_mse: 22743916544.0000
Epoch 10/32
409/409 [==============================] - 3s 7ms/step - loss: 23045424736.8609 - mse: 23054147584.0000 - val_loss: 22209141332.5049 - val_mse: 22103619584.0000
Epoch 11/32
409/409 [==============================] - 3s 7ms/step - loss: 22368786352.8642 - mse: 22377254912.0000 - val_loss: 21517526035.8835 - val_mse: 21413390336.0000
Epoch 12/32
409/409 [==============================] - 3s 7ms/step - loss: 21573610428.5529 - mse: 21568229376.0000 - val_loss: 20703399578.0971 - val_mse: 20605448192.0000
Epoch 13/32
409/409 [==============================] - 3s 7ms/step - loss: 20687236071.1799 - mse: 20693315584.0000 - val_loss: 19761445311.3786 - val_mse: 19665211392.0000
Epoch 14/32
409/409 [==============================] - 3s 7ms/step - loss: 19631878629.7214 - mse: 19623428096.0000 - val_loss: 18682552568.5437 - val_mse: 18590593024.0000
Epoch 15/32
409/409 [==============================] - 3s 7ms/step - loss: 18389918178.1530 - mse: 18388600832.0000 - val_loss: 17436753860.3495 - val_mse: 17353312256.0000
Epoch 16/32
409/409 [==============================] - 3s 7ms/step - loss: 17014257165.3783 - mse: 17015332864.0000 - val_loss: 16038545567.0680 - val_mse: 15961856000.0000
Epoch 17/32
409/409 [==============================] - 3s 7ms/step - loss: 15466389170.1447 - mse: 15468753920.0000 - val_loss: 14556398010.4078 - val_mse: 14488811520.0000
Epoch 18/32
409/409 [==============================] - 3s 7ms/step - loss: 13920044185.3261 - mse: 13923014656.0000 - val_loss: 13190283264.0000 - val_mse: 13135559680.0000
Epoch 19/32
409/409 [==============================] - 3s 7ms/step - loss: 12637915010.9097 - mse: 12635006976.0000 - val_loss: 12197742758.5243 - val_mse: 12155337728.0000
Epoch 20/32
409/409 [==============================] - 3s 7ms/step - loss: 11764854253.8132 - mse: 11761691648.0000 - val_loss: 11640998663.4563 - val_mse: 11611085824.0000
Epoch 21/32
409/409 [==============================] - 3s 7ms/step - loss: 11309559065.8943 - mse: 11311616000.0000 - val_loss: 11441522827.1845 - val_mse: 11420901376.0000
Epoch 22/32
409/409 [==============================] - 3s 7ms/step - loss: 11138298650.6274 - mse: 11134788608.0000 - val_loss: 11333174754.1748 - val_mse: 11317945344.0000
Epoch 23/32
409/409 [==============================] - 3s 7ms/step - loss: 11039242999.3815 - mse: 11041127424.0000 - val_loss: 11283897015.9223 - val_mse: 11271598080.0000
Epoch 24/32
409/409 [==============================] - 3s 7ms/step - loss: 10996809734.2405 - mse: 10993611776.0000 - val_loss: 11237824437.4369 - val_mse: 11226932224.0000
Epoch 25/32
409/409 [==============================] - 3s 7ms/step - loss: 10942176692.4016 - mse: 10939288576.0000 - val_loss: 11194542890.2524 - val_mse: 11185077248.0000
Epoch 26/32
409/409 [==============================] - 3s 7ms/step - loss: 10892755362.8816 - mse: 10893875200.0000 - val_loss: 11145629949.5146 - val_mse: 11137082368.0000
Epoch 27/32
409/409 [==============================] - 3s 7ms/step - loss: 10837225182.4292 - mse: 10840022016.0000 - val_loss: 11099079630.2913 - val_mse: 11090993152.0000
Epoch 28/32
409/409 [==============================] - 3s 7ms/step - loss: 10800527090.9747 - mse: 10802595840.0000 - val_loss: 11043985949.8252 - val_mse: 11036091392.0000
Epoch 29/32
409/409 [==============================] - 3s 7ms/step - loss: 10752542865.1962 - mse: 10752277504.0000 - val_loss: 10999730409.6311 - val_mse: 10992988160.0000
Epoch 30/32
409/409 [==============================] - 3s 7ms/step - loss: 10705624150.6556 - mse: 10705743872.0000 - val_loss: 10955175339.4951 - val_mse: 10949067776.0000
Epoch 31/32
409/409 [==============================] - 3s 7ms/step - loss: 10649445051.2510 - mse: 10654038016.0000 - val_loss: 10904979759.2233 - val_mse: 10899533824.0000
Epoch 32/32
409/409 [==============================] - 3s 7ms/step - loss: 10605813353.2160 - mse: 10603767808.0000 - val_loss: 10851134275.1068 - val_mse: 10846366720.0000

Next, we show loss and mean squared error then plot the model.


In [48]:
loss, mse = model.evaluate(train_ds)
print("Mean Squared Error", mse)


409/409 [==============================] - 2s 5ms/step - loss: 10572296459.8924 - mse: 10566608896.0000
Mean Squared Error 10566609000.0

In [49]:
plot_curves(history, ['loss', 'mse'])


Next we create a prediction model. Note: You may use the same values from the previous prediciton.


In [50]:
# TODO 2
# Median_house_value is $249,000, prediction is $234,000 NEAR OCEAN
model.predict({
    'longitude': tf.convert_to_tensor([-122.43]),
    'latitude': tf.convert_to_tensor([37.63]),
    'housing_median_age': tf.convert_to_tensor([34.0]),
    'total_rooms': tf.convert_to_tensor([4135.0]),
    'total_bedrooms': tf.convert_to_tensor([687.0]),
    'population': tf.convert_to_tensor([2154.0]),
    'households': tf.convert_to_tensor([742.0]),
    'median_income': tf.convert_to_tensor([4.9732]),
    'ocean_proximity': tf.convert_to_tensor(['NEAR OCEAN'])
}, steps=1)


Out[50]:
array([[235414.17]], dtype=float32)

Analysis

The array returns a predicted value. Compare this value to the test set you ran earlier. Your predicted value may be a bit better.

Now that you have your "feature engineering template" setup, you can experiment by creating additional features. For example, you can create derived features, such as households per population, and see how they impact the model. You can also experiment with replacing the features you used to create the feature cross.

Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.