Learning Objectives
In this lab, we utilize feature engineering to improve the prediction of housing prices using a Keras Sequential Model.
Each learning objective will correspond to a #TODO in the student lab notebook -- try to complete that notebook first before reviewing this solution notebook.
Start by importing the necessary libraries for this lab.
In [1]:
# Install Sklearn
!python3 -m pip install --user sklearn
# Ensure the right version of Tensorflow is installed.
!pip3 freeze | grep 'tensorflow==2\|tensorflow-gpu==2' || \
!python3 -m pip install --user tensorflow==2
In [40]:
import os
import tensorflow.keras
import matplotlib.pyplot as plt
import pandas as pd
import tensorflow as tf
from tensorflow import feature_column as fc
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split
from keras.utils import plot_model
print("TensorFlow version: ",tf.version.VERSION)
Many of the Google Machine Learning Courses Programming Exercises use the California Housing Dataset, which contains data drawn from the 1990 U.S. Census. Our lab dataset has been pre-processed so that there are no missing values.
First, let's download the raw .csv data by copying the data from a cloud storage bucket.
In [5]:
if not os.path.isdir("../data"):
os.makedirs("../data")
In [6]:
!gsutil cp gs://cloud-training-demos/feat_eng/housing/housing_pre-proc.csv ../data
In [9]:
!ls -l ../data/
Now, let's read in the dataset just copied from the cloud storage bucket and create a Pandas dataframe.
In [11]:
housing_df = pd.read_csv('../data/housing_pre-proc.csv', error_bad_lines=False)
housing_df.head()
Out[11]:
We can use .describe() to see some summary statistics for the numeric fields in our dataframe. Note, for example, the count row and corresponding columns. The count shows 20433.000000 for all feature columns. Thus, there are no missing values.
In [12]:
housing_df.describe()
Out[12]:
In [13]:
train, test = train_test_split(housing_df, test_size=0.2)
train, val = train_test_split(train, test_size=0.2)
print(len(train), 'train examples')
print(len(val), 'validation examples')
print(len(test), 'test examples')
Now, we need to output the split files. We will specifically need the test.csv later for testing. You should see the files appear in the home directory.
In [14]:
train.to_csv('../data/housing-train.csv', encoding='utf-8', index=False)
In [15]:
val.to_csv('../data/housing-val.csv', encoding='utf-8', index=False)
In [17]:
test.to_csv('../data/housing-test.csv', encoding='utf-8', index=False)
In [18]:
!head ../data/housing*.csv
Next, we will wrap the dataframes with tf.data. This will enable us to use feature columns as a bridge to map from the columns in the Pandas dataframe to features used to train the model.
Here, we create an input pipeline using tf.data. This function is missing two lines. Correct and run the cell.
In [19]:
# A utility method to create a tf.data dataset from a Pandas Dataframe
# TODO 1
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
dataframe = dataframe.copy()
labels = dataframe.pop('median_house_value')
ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
if shuffle:
ds = ds.shuffle(buffer_size=len(dataframe))
ds = ds.batch(batch_size)
return ds
Next we initialize the training and validation datasets.
In [20]:
batch_size = 32
train_ds = df_to_dataset(train)
val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)
Now that we have created the input pipeline, let's call it to see the format of the data it returns. We have used a small batch size to keep the output readable.
In [21]:
# TODO 1
for feature_batch, label_batch in train_ds.take(1):
print('Every feature:', list(feature_batch.keys()))
print('A batch of households:', feature_batch['households'])
print('A batch of ocean_proximity:', feature_batch['ocean_proximity'])
print('A batch of targets:', label_batch)
We can see that the dataset returns a dictionary of column names (from the dataframe) that map to column values from rows in the dataframe.
The output of a feature column becomes the input to the model. A numeric is the simplest type of column. It is used to represent real valued features. When using this column, your model will receive the column value from the dataframe unchanged.
In the California housing prices dataset, most columns from the dataframe are numeric. Let' create a variable called numeric_cols to hold only the numerical feature columns.
In [22]:
# TODO 1
numeric_cols = ['longitude', 'latitude', 'housing_median_age', 'total_rooms',
'total_bedrooms', 'population', 'households', 'median_income']
It is very important for numerical variables to get scaled before they are "fed" into the neural network. Here we use min-max scaling. Here we are creating a function named 'get_scal' which takes a list of numerical features and returns a 'minmax' function, which will be used in tf.feature_column.numeric_column() as normalizer_fn in parameters. 'Minmax' function itself takes a 'numerical' number from a particular feature and return scaled value of that number.
Next, we scale the numerical feature columns that we assigned to the variable "numeric cols".
In [23]:
# Scalar def get_scal(feature):
# TODO 1
def get_scal(feature):
def minmax(x):
mini = train[feature].min()
maxi = train[feature].max()
return (x - mini)/(maxi-mini)
return(minmax)
In [24]:
# TODO 1
feature_columns = []
for header in numeric_cols:
scal_input_fn = get_scal(header)
feature_columns.append(fc.numeric_column(header,
normalizer_fn=scal_input_fn))
Next, we should validate the total number of feature columns. Compare this number to the number of numeric features you input earlier.
In [25]:
print('Total number of feature coLumns: ', len(feature_columns))
In [26]:
# Model create
feature_layer = tf.keras.layers.DenseFeatures(feature_columns, dtype='float64')
model = tf.keras.Sequential([
feature_layer,
layers.Dense(12, input_dim=8, activation='relu'),
layers.Dense(8, activation='relu'),
layers.Dense(1, activation='linear', name='median_house_value')
])
# Model compile
model.compile(optimizer='adam',
loss='mse',
metrics=['mse'])
# Model Fit
history = model.fit(train_ds,
validation_data=val_ds,
epochs=32)
Next we show loss as Mean Square Error (MSE). Remember that MSE is the most commonly used regression loss function. MSE is the sum of squared distances between our target variable (e.g. housing median age) and predicted values.
In [27]:
loss, mse = model.evaluate(train_ds)
print("Mean Squared Error", mse)
In [29]:
def plot_curves(history, metrics):
nrows = 1
ncols = 2
fig = plt.figure(figsize=(10, 5))
for idx, key in enumerate(metrics):
ax = fig.add_subplot(nrows, ncols, idx+1)
plt.plot(history.history[key])
plt.plot(history.history['val_{}'.format(key)])
plt.title('model {}'.format(key))
plt.ylabel(key)
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left');
In [30]:
plot_curves(history, ['loss', 'mse'])
Next, we read in the test.csv file and validate that there are no null values.
Again, we can use .describe() to see some summary statistics for the numeric fields in our dataframe. The count shows 4087.000000 for all feature columns. Thus, there are no missing values.
In [31]:
test_data = pd.read_csv('../data/housing-test.csv')
test_data.describe()
Out[31]:
Now that we have created an input pipeline using tf.data and compiled a Keras Sequential Model, we now create the input function for the test data and to initialize the test_predict variable.
In [32]:
# TODO 1
def test_input_fn(features, batch_size=256):
"""An input function for prediction."""
# Convert the inputs to a Dataset without labels.
return tf.data.Dataset.from_tensor_slices(dict(features)).batch(batch_size)
In [33]:
test_predict = test_input_fn(dict(test_data))
Before we begin to feature engineer our feature columns, we should predict the median house value. By predicting the median house value now, we can then compare it with the median house value after feature engineeing.
To predict with Keras, you simply call model.predict() and pass in the housing features you want to predict the median_house_value for. Note: We are predicting the model locally.
In [34]:
predicted_median_house_value = model.predict(test_predict)
Next, we run two predictions in separate cells - one where ocean_proximity=INLAND and one where ocean_proximity= NEAR OCEAN.
In [35]:
# Ocean_proximity is INLAND
model.predict({
'longitude': tf.convert_to_tensor([-121.86]),
'latitude': tf.convert_to_tensor([39.78]),
'housing_median_age': tf.convert_to_tensor([12.0]),
'total_rooms': tf.convert_to_tensor([7653.0]),
'total_bedrooms': tf.convert_to_tensor([1578.0]),
'population': tf.convert_to_tensor([3628.0]),
'households': tf.convert_to_tensor([1494.0]),
'median_income': tf.convert_to_tensor([3.0905]),
'ocean_proximity': tf.convert_to_tensor(['INLAND'])
}, steps=1)
Out[35]:
In [36]:
# Ocean_proximity is NEAR OCEAN
model.predict({
'longitude': tf.convert_to_tensor([-122.43]),
'latitude': tf.convert_to_tensor([37.63]),
'housing_median_age': tf.convert_to_tensor([34.0]),
'total_rooms': tf.convert_to_tensor([4135.0]),
'total_bedrooms': tf.convert_to_tensor([687.0]),
'population': tf.convert_to_tensor([2154.0]),
'households': tf.convert_to_tensor([742.0]),
'median_income': tf.convert_to_tensor([4.9732]),
'ocean_proximity': tf.convert_to_tensor(['NEAR OCEAN'])
}, steps=1)
Out[36]:
The arrays returns a predicted value. What do these numbers mean? Let's compare this value to the test set.
Go to the test.csv you read in a few cells up. Locate the first line and find the median_house_value - which should be 249,000 dollars near the ocean. What value did your model predicted for the median_house_value? Was it a solid model performance? Let's see if we can improve this a bit with feature engineering!
Now we create a cell that indicates which features will be used in the model.
Note: Be sure to bucketize 'housing_median_age' and ensure that 'ocean_proximity' is one-hot encoded. And, don't forget your numeric values!
In [37]:
# TODO 2
numeric_cols = ['longitude', 'latitude', 'housing_median_age', 'total_rooms',
'total_bedrooms', 'population', 'households', 'median_income']
bucketized_cols = ['housing_median_age']
# indicator columns,Categorical features
categorical_cols = ['ocean_proximity']
Next, we scale the numerical, bucktized, and categorical feature columns that we assigned to the variables in the precding cell.
In [41]:
# Scalar def get_scal(feature):
def get_scal(feature):
def minmax(x):
mini = train[feature].min()
maxi = train[feature].max()
return (x - mini)/(maxi-mini)
return(minmax)
In [42]:
# All numerical features - scaling
feature_columns = []
for header in numeric_cols:
scal_input_fn = get_scal(header)
feature_columns.append(fc.numeric_column(header,
normalizer_fn=scal_input_fn))
Next, we create a categorical feature using 'ocean_proximity'.
In [43]:
# TODO 2
for feature_name in categorical_cols:
vocabulary = housing_df[feature_name].unique()
categorical_c = fc.categorical_column_with_vocabulary_list(feature_name, vocabulary)
one_hot = fc.indicator_column(categorical_c)
feature_columns.append(one_hot)
Often, you don't want to feed a number directly into the model, but instead split its value into different categories based on numerical ranges. Consider our raw data that represents a homes' age. Instead of representing the house age as a numeric column, we could split the home age into several buckets using a bucketized column. Notice the one-hot values below describe which age range each row matches.
Next we create a bucketized column using 'housing_median_age'
In [44]:
# TODO 2
age = fc.numeric_column("housing_median_age")
# Bucketized cols
age_buckets = fc.bucketized_column(age, boundaries=[10, 20, 30, 40, 50, 60, 80, 100])
feature_columns.append(age_buckets)
Combining features into a single feature, better known as feature crosses, enables a model to learn separate weights for each combination of features.
Next, we create a feature cross of 'housing_median_age' and 'ocean_proximity'.
In [45]:
# TODO 2
vocabulary = housing_df['ocean_proximity'].unique()
ocean_proximity = fc.categorical_column_with_vocabulary_list('ocean_proximity',
vocabulary)
crossed_feature = fc.crossed_column([age_buckets, ocean_proximity],
hash_bucket_size=1000)
crossed_feature = fc.indicator_column(crossed_feature)
feature_columns.append(crossed_feature)
Next, we should validate the total number of feature columns. Compare this number to the number of numeric features you input earlier.
In [46]:
print('Total number of feature coumns: ', len(feature_columns))
Next, we will run this cell to compile and fit the Keras Sequential model. This is the same model we ran earlier.
In [47]:
# Model create
feature_layer = tf.keras.layers.DenseFeatures(feature_columns,
dtype='float64')
model = tf.keras.Sequential([
feature_layer,
layers.Dense(12, input_dim=8, activation='relu'),
layers.Dense(8, activation='relu'),
layers.Dense(1, activation='linear', name='median_house_value')
])
# Model compile
model.compile(optimizer='adam',
loss='mse',
metrics=['mse'])
# Model Fit
history = model.fit(train_ds,
validation_data=val_ds,
epochs=32)
Next, we show loss and mean squared error then plot the model.
In [48]:
loss, mse = model.evaluate(train_ds)
print("Mean Squared Error", mse)
In [49]:
plot_curves(history, ['loss', 'mse'])
Next we create a prediction model. Note: You may use the same values from the previous prediciton.
In [50]:
# TODO 2
# Median_house_value is $249,000, prediction is $234,000 NEAR OCEAN
model.predict({
'longitude': tf.convert_to_tensor([-122.43]),
'latitude': tf.convert_to_tensor([37.63]),
'housing_median_age': tf.convert_to_tensor([34.0]),
'total_rooms': tf.convert_to_tensor([4135.0]),
'total_bedrooms': tf.convert_to_tensor([687.0]),
'population': tf.convert_to_tensor([2154.0]),
'households': tf.convert_to_tensor([742.0]),
'median_income': tf.convert_to_tensor([4.9732]),
'ocean_proximity': tf.convert_to_tensor(['NEAR OCEAN'])
}, steps=1)
Out[50]:
The array returns a predicted value. Compare this value to the test set you ran earlier. Your predicted value may be a bit better.
Now that you have your "feature engineering template" setup, you can experiment by creating additional features. For example, you can create derived features, such as households per population, and see how they impact the model. You can also experiment with replacing the features you used to create the feature cross.
Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.