LAB 03: Basic Feature Engineering in Keras

Learning Objectives

  1. Create an input pipeline using tf.data
  2. Engineer features to create categorical, crossed, and numerical feature columns

Introduction

In this lab, we utilize feature engineering to improve the prediction of housing prices using a Keras Sequential Model.

Each learning objective will correspond to a #TODO in the notebook where you will complete the notebook cell's code before running. Refer to the solution for reference.

Start by importing the necessary libraries for this lab.


In [ ]:
# Install Sklearn
!python3 -m pip install --user sklearn

# Ensure the right version of Tensorflow is installed.
!pip3 freeze | grep 'tensorflow==2\|tensorflow-gpu==2' || \
!python3 -m pip install --user tensorflow==2

In [ ]:
import os
import tensorflow.keras

import matplotlib.pyplot as plt
import pandas as pd
import tensorflow as tf

from tensorflow import feature_column as fc
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split
from keras.utils import plot_model

print("TensorFlow version: ",tf.version.VERSION)

Many of the Google Machine Learning Courses Programming Exercises use the California Housing Dataset, which contains data drawn from the 1990 U.S. Census. Our lab dataset has been pre-processed so that there are no missing values.

First, let's download the raw .csv data by copying the data from a cloud storage bucket.


In [ ]:
if not os.path.isdir("../data"):
    os.makedirs("../data")

In [ ]:
!gsutil cp gs://cloud-training-demos/feat_eng/housing/housing_pre-proc.csv ../data

In [ ]:
!ls -l ../data/

Now, let's read in the dataset just copied from the cloud storage bucket and create a Pandas dataframe.


In [ ]:
housing_df = pd.read_csv('../data/housing_pre-proc.csv', error_bad_lines=False)
housing_df.head()

We can use .describe() to see some summary statistics for the numeric fields in our dataframe. Note, for example, the count row and corresponding columns. The count shows 20433.000000 for all feature columns. Thus, there are no missing values.


In [ ]:
housing_df.describe()

Split the dataset for ML

The dataset we loaded was a single CSV file. We will split this into train, validation, and test sets.


In [ ]:
train, test = train_test_split(housing_df, test_size=0.2)
train, val = train_test_split(train, test_size=0.2)

print(len(train), 'train examples')
print(len(val), 'validation examples')
print(len(test), 'test examples')

Now, we need to output the split files. We will specifically need the test.csv later for testing. You should see the files appear in the home directory.


In [ ]:
train.to_csv('../data/housing-train.csv', encoding='utf-8', index=False)

In [ ]:
val.to_csv('../data/housing-val.csv', encoding='utf-8', index=False)

In [ ]:
test.to_csv('../data/housing-test.csv', encoding='utf-8', index=False)

In [ ]:
!head ../data/housing*.csv

Create an input pipeline using tf.data

Next, we will wrap the dataframes with tf.data. This will enable us to use feature columns as a bridge to map from the columns in the Pandas dataframe to features used to train the model.

Here, we create an input pipeline using tf.data. This function is missing two lines. Correct and run the cell.


In [ ]:
# A utility method to create a tf.data dataset from a Pandas Dataframe

def df_to_dataset(dataframe, shuffle=True, batch_size=32):
    dataframe = dataframe.copy()
    
   # TODO 1 -- Your code here

    if shuffle:
        ds = ds.shuffle(buffer_size=len(dataframe))
    ds = ds.batch(batch_size)
    return ds

Next we initialize the training and validation datasets.


In [ ]:
batch_size = 32
train_ds = df_to_dataset(train)
val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)

Now that we have created the input pipeline, let's call it to see the format of the data it returns. We have used a small batch size to keep the output readable.


In [ ]:
# TODO 1

We can see that the dataset returns a dictionary of column names (from the dataframe) that map to column values from rows in the dataframe.

Numeric columns

The output of a feature column becomes the input to the model. A numeric is the simplest type of column. It is used to represent real valued features. When using this column, your model will receive the column value from the dataframe unchanged.

In the California housing prices dataset, most columns from the dataframe are numeric. Let' create a variable called numeric_cols to hold only the numerical feature columns.


In [ ]:
# TODO 1

Scaler function

It is very important for numerical variables to get scaled before they are "fed" into the neural network. Here we use min-max scaling. Here we are creating a function named 'get_scal' which takes a list of numerical features and returns a 'minmax' function, which will be used in tf.feature_column.numeric_column() as normalizer_fn in parameters. 'Minmax' function itself takes a 'numerical' number from a particular feature and return scaled value of that number.

Next, we scale the numerical feature columns that we assigned to the variable "numeric cols".


In [ ]:
# Scalar def get_scal(feature):
# TODO 1

In [ ]:
# TODO 1

Next, we should validate the total number of feature columns. Compare this number to the number of numeric features you input earlier.


In [ ]:
print('Total number of feature coLumns: ', len(feature_columns))

Using the Keras Sequential Model

Next, we will run this cell to compile and fit the Keras Sequential model.


In [ ]:
# Model create
feature_layer = tf.keras.layers.DenseFeatures(feature_columns, dtype='float64')

model = tf.keras.Sequential([
  feature_layer,
  layers.Dense(12, input_dim=8, activation='relu'),
  layers.Dense(8, activation='relu'),
  layers.Dense(1, activation='linear',  name='median_house_value')
])

# Model compile
model.compile(optimizer='adam',
              loss='mse',
              metrics=['mse'])

# Model Fit
history = model.fit(train_ds,
                    validation_data=val_ds,
                    epochs=32)

Next we show loss as Mean Square Error (MSE). Remember that MSE is the most commonly used regression loss function. MSE is the sum of squared distances between our target variable (e.g. housing median age) and predicted values.


In [ ]:
loss, mse = model.evaluate(train_ds)
print("Mean Squared Error", mse)

Visualize the model loss curve

Next, we will use matplotlib to draw the model's loss curves for training and validation. A line plot is also created showing the mean squared error loss over the training epochs for both the train (blue) and test (orange) sets.


In [ ]:
def plot_curves(history, metrics):
    nrows = 1
    ncols = 2
    fig = plt.figure(figsize=(10, 5))

    for idx, key in enumerate(metrics):  
        ax = fig.add_subplot(nrows, ncols, idx+1)
        plt.plot(history.history[key])
        plt.plot(history.history['val_{}'.format(key)])
        plt.title('model {}'.format(key))
        plt.ylabel(key)
        plt.xlabel('epoch')
        plt.legend(['train', 'validation'], loc='upper left');

In [ ]:
plot_curves(history, ['loss', 'mse'])

Load test data

Next, we read in the test.csv file and validate that there are no null values.

Again, we can use .describe() to see some summary statistics for the numeric fields in our dataframe. The count shows 4087.000000 for all feature columns. Thus, there are no missing values.


In [ ]:
test_data = pd.read_csv('../data/housing-test.csv')
test_data.describe()

Now that we have created an input pipeline using tf.data and compiled a Keras Sequential Model, we now create the input function for the test data and to initialize the test_predict variable.


In [ ]:
# TODO 1

In [ ]:
test_predict = test_input_fn(dict(test_data))

Prediction: Linear Regression

Before we begin to feature engineer our feature columns, we should predict the median house value. By predicting the median house value now, we can then compare it with the median house value after feature engineeing.

To predict with Keras, you simply call model.predict() and pass in the housing features you want to predict the median_house_value for. Note: We are predicting the model locally.


In [ ]:
predicted_median_house_value = model.predict(test_predict)

Next, we run two predictions in separate cells - one where ocean_proximity=INLAND and one where ocean_proximity= NEAR OCEAN.


In [ ]:
# Ocean_proximity is INLAND
model.predict({
    'longitude': tf.convert_to_tensor([-121.86]),
    'latitude': tf.convert_to_tensor([39.78]),
    'housing_median_age': tf.convert_to_tensor([12.0]),
    'total_rooms': tf.convert_to_tensor([7653.0]),
    'total_bedrooms': tf.convert_to_tensor([1578.0]),
    'population': tf.convert_to_tensor([3628.0]),
    'households': tf.convert_to_tensor([1494.0]),
    'median_income': tf.convert_to_tensor([3.0905]),
    'ocean_proximity': tf.convert_to_tensor(['INLAND'])
}, steps=1)

In [ ]:
# Ocean_proximity is NEAR OCEAN
model.predict({
    'longitude': tf.convert_to_tensor([-122.43]),
    'latitude': tf.convert_to_tensor([37.63]),
    'housing_median_age': tf.convert_to_tensor([34.0]),
    'total_rooms': tf.convert_to_tensor([4135.0]),
    'total_bedrooms': tf.convert_to_tensor([687.0]),
    'population': tf.convert_to_tensor([2154.0]),
    'households': tf.convert_to_tensor([742.0]),
    'median_income': tf.convert_to_tensor([4.9732]),
    'ocean_proximity': tf.convert_to_tensor(['NEAR OCEAN'])
}, steps=1)

The arrays returns a predicted value. What do these numbers mean? Let's compare this value to the test set.

Go to the test.csv you read in a few cells up. Locate the first line and find the median_house_value - which should be 249,000 dollars near the ocean. What value did your model predict for the median_house_value? Was it a solid model performance? Let's see if we can improve this a bit with feature engineering!

Engineer features to create categorical and numerical features

Now we create a cell that indicates which features will be used in the model.
Note: Be sure to bucketize 'housing_median_age' and ensure that 'ocean_proximity' is one-hot encoded. And, don't forget your numeric values!


In [ ]:
# TODO 2

Next, we scale the numerical, bucktized, and categorical feature columns that we assigned to the variables in the precding cell.


In [ ]:
# Scalar def get_scal(feature):
def get_scal(feature):
    def minmax(x):
        mini = train[feature].min()
        maxi = train[feature].max()
        return (x - mini)/(maxi-mini)
        return(minmax)

In [ ]:
# All numerical features - scaling
feature_columns = []
for header in numeric_cols:
    scal_input_fn = get_scal(header)
    feature_columns.append(fc.numeric_column(header,
                                             normalizer_fn=scal_input_fn))

Categorical Feature

In this dataset, 'ocean_proximity' is represented as a string. We cannot feed strings directly to a model. Instead, we must first map them to numeric values. The categorical vocabulary columns provide a way to represent strings as a one-hot vector.

Next, we create a categorical feature using 'ocean_proximity'.


In [ ]:
# TODO 2

Bucketized Feature

Often, you don't want to feed a number directly into the model, but instead split its value into different categories based on numerical ranges. Consider our raw data that represents a homes' age. Instead of representing the house age as a numeric column, we could split the home age into several buckets using a bucketized column. Notice the one-hot values below describe which age range each row matches.

Next we create a bucketized column using 'housing_median_age'


In [ ]:
# TODO 2

Feature Cross

Combining features into a single feature, better known as feature crosses, enables a model to learn separate weights for each combination of features.

Next, we create a feature cross of 'housing_median_age' and 'ocean_proximity'.


In [ ]:
# TODO 2

Next, we should validate the total number of feature columns. Compare this number to the number of numeric features you input earlier.


In [ ]:
print('Total number of feature coumns: ', len(feature_columns))

Next, we will run this cell to compile and fit the Keras Sequential model. This is the same model we ran earlier.


In [ ]:
# Model create
feature_layer = tf.keras.layers.DenseFeatures(feature_columns,
                                              dtype='float64')

model = tf.keras.Sequential([
  feature_layer,
  layers.Dense(12, input_dim=8, activation='relu'),
  layers.Dense(8, activation='relu'),
  layers.Dense(1, activation='linear',  name='median_house_value')
])

# Model compile
model.compile(optimizer='adam',
              loss='mse',
              metrics=['mse'])

# Model Fit
history = model.fit(train_ds,
                    validation_data=val_ds,
                    epochs=32)

Next, we show loss and mean squared error then plot the model.


In [ ]:
loss, mse = model.evaluate(train_ds)
print("Mean Squared Error", mse)

In [ ]:
plot_curves(history, ['loss', 'mse'])

Next we create a prediction model. Note: You may use the same values from the previous prediciton.


In [ ]:
# TODO 2

Analysis

The array returns a predicted value. Compare this value to the test set you ran earlier. Your predicted value may be a bit better.

Now that you have your "feature engineering template" setup, you can experiment by creating additional features. For example, you can create derived features, such as households per population, and see how they impact the model. You can also experiment with replacing the features you used to create the feature cross.

Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.