Learning Objectives
In this notebook, we use Keras to build a taxifare price prediction model and utilize feature engineering to improve the fare amount prediction for NYC taxi cab rides.
Each learning objective will correspond to a #TODO in the student lab notebook -- try to complete that notebook first before reviewing this solution notebook.
In [1]:
import datetime
import logging
import os
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from tensorflow import feature_column as fc
from tensorflow.keras import layers
from tensorflow.keras import models
# set TF error log verbosity
logging.getLogger("tensorflow").setLevel(logging.ERROR)
print(tf.version.VERSION)
The Taxi Fare dataset for this lab is 106,545 rows and has been pre-processed and split for use in this lab. Note that the dataset is the same as used in the Big Query feature engineering labs. The fare_amount is the target, the continuous value we’ll train a model to predict.
First, let's download the .csv data by copying the data from a cloud storage bucket.
In [2]:
if not os.path.isdir("../data"):
os.makedirs("../data")
In [3]:
!gsutil cp gs://cloud-training-demos/feat_eng/data/*.csv ../data
Let's check that the files were copied correctly and look like we expect them to.
In [4]:
!ls -l ../data/*.csv
In [5]:
!head ../data/*.csv
Typically, you will use a two step proces to build the pipeline. Step one is to define the columns of data; i.e., which column we're predicting for, and the default values. Step 2 is to define two functions - a function to define the features and label you want to use and a function to load the training data. Also, note that pickup_datetime is a string and we will need to handle this in our feature engineered model.
In [6]:
CSV_COLUMNS = [
'fare_amount',
'pickup_datetime',
'pickup_longitude',
'pickup_latitude',
'dropoff_longitude',
'dropoff_latitude',
'passenger_count',
'key',
]
LABEL_COLUMN = 'fare_amount'
STRING_COLS = ['pickup_datetime']
NUMERIC_COLS = ['pickup_longitude', 'pickup_latitude',
'dropoff_longitude', 'dropoff_latitude',
'passenger_count']
DEFAULTS = [[0.0], ['na'], [0.0], [0.0], [0.0], [0.0], [0.0], ['na']]
DAYS = ['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat']
In [7]:
# A function to define features and labesl
def features_and_labels(row_data):
for unwanted_col in ['key']:
row_data.pop(unwanted_col)
label = row_data.pop(LABEL_COLUMN)
return row_data, label
# A utility method to create a tf.data dataset from a Pandas Dataframe
def load_dataset(pattern, batch_size=1, mode=tf.estimator.ModeKeys.EVAL):
dataset = tf.data.experimental.make_csv_dataset(pattern,
batch_size,
CSV_COLUMNS,
DEFAULTS)
dataset = dataset.map(features_and_labels) # features, label
if mode == tf.estimator.ModeKeys.TRAIN:
dataset = dataset.shuffle(1000).repeat()
# take advantage of multi-threading; 1=AUTOTUNE
dataset = dataset.prefetch(1)
return dataset
Now let's build the Deep Neural Network (DNN) model in Keras using the functional API. Unlike the sequential API, we will need to specify the input and hidden layers. Note that we are creating a linear regression baseline model with no feature engineering. Recall that a baseline model is a solution to a problem without applying any machine learning techniques.
In [8]:
# Build a simple Keras DNN using its Functional API
def rmse(y_true, y_pred): # Root mean square error
return tf.sqrt(tf.reduce_mean(tf.square(y_pred - y_true)))
def build_dnn_model():
# input layer
inputs = {
colname: layers.Input(name=colname, shape=(), dtype='float32')
for colname in NUMERIC_COLS
}
# feature_columns
feature_columns = {
colname: fc.numeric_column(colname)
for colname in NUMERIC_COLS
}
# Constructor for DenseFeatures takes a list of numeric columns
dnn_inputs = layers.DenseFeatures(feature_columns.values())(inputs)
# two hidden layers of [32, 8] just in like the BQML DNN
h1 = layers.Dense(32, activation='relu', name='h1')(dnn_inputs)
h2 = layers.Dense(8, activation='relu', name='h2')(h1)
# final output is a linear activation because this is regression
output = layers.Dense(1, activation='linear', name='fare')(h2)
model = models.Model(inputs, output)
# compile model
model.compile(optimizer='adam', loss='mse', metrics=[rmse, 'mse'])
return model
We'll build our DNN model and inspect the model architecture.
In [9]:
model = build_dnn_model()
tf.keras.utils.plot_model(model, 'dnn_model.png', show_shapes=False, rankdir='LR')
Out[9]:
To train the model, simply call model.fit(). Note that we should really use many more NUM_TRAIN_EXAMPLES (i.e. a larger dataset). We shouldn't make assumptions about the quality of the model based on training/evaluating it on a small sample of the full data.
We start by setting up the environment variables for training, creating the input pipeline datasets, and then train our baseline DNN model.
In [10]:
TRAIN_BATCH_SIZE = 32
NUM_TRAIN_EXAMPLES = 59621 * 5
NUM_EVALS = 5
NUM_EVAL_EXAMPLES = 14906
In [11]:
trainds = load_dataset('../data/taxi-train*',
TRAIN_BATCH_SIZE,
tf.estimator.ModeKeys.TRAIN)
evalds = load_dataset('../data/taxi-valid*',
1000,
tf.estimator.ModeKeys.EVAL).take(NUM_EVAL_EXAMPLES//1000)
steps_per_epoch = NUM_TRAIN_EXAMPLES // (TRAIN_BATCH_SIZE * NUM_EVALS)
history = model.fit(trainds,
validation_data=evalds,
epochs=NUM_EVALS,
steps_per_epoch=steps_per_epoch)
In [12]:
def plot_curves(history, metrics):
nrows = 1
ncols = 2
fig = plt.figure(figsize=(10, 5))
for idx, key in enumerate(metrics):
ax = fig.add_subplot(nrows, ncols, idx+1)
plt.plot(history.history[key])
plt.plot(history.history['val_{}'.format(key)])
plt.title('model {}'.format(key))
plt.ylabel(key)
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left');
In [13]:
plot_curves(history, ['loss', 'mse'])
To predict with Keras, you simply call model.predict() and pass in the cab ride you want to predict the fare amount for. Next we note the fare price at this geolocation and pickup_datetime.
In [14]:
model.predict({
'pickup_longitude': tf.convert_to_tensor([-73.982683]),
'pickup_latitude': tf.convert_to_tensor([40.742104]),
'dropoff_longitude': tf.convert_to_tensor([-73.983766]),
'dropoff_latitude': tf.convert_to_tensor([40.755174]),
'passenger_count': tf.convert_to_tensor([3.0]),
'pickup_datetime': tf.convert_to_tensor(['2010-02-08 09:17:00 UTC'], dtype=tf.string),
}, steps=1)
Out[14]:
We incorporate the temporal feature pickup_datetime. As noted earlier, pickup_datetime is a string and we will need to handle this within the model. First, you will include the pickup_datetime as a feature and then you will need to modify the model to handle our string feature.
In [15]:
# TODO 1a
def parse_datetime(s):
if type(s) is not str:
s = s.numpy().decode('utf-8')
return datetime.datetime.strptime(s, "%Y-%m-%d %H:%M:%S %Z")
# TODO 1b
def get_dayofweek(s):
ts = parse_datetime(s)
return DAYS[ts.weekday()]
# TODO 1c
@tf.function
def dayofweek(ts_in):
return tf.map_fn(
lambda s: tf.py_function(get_dayofweek, inp=[s], Tout=tf.string),
ts_in)
The pick-up/drop-off longitude and latitude data are crucial to predicting the fare amount as fare amounts in NYC taxis are largely determined by the distance traveled. As such, we need to teach the model the Euclidean distance between the pick-up and drop-off points.
Recall that latitude and longitude allows us to specify any location on Earth using a set of coordinates. In our training data set, we restricted our data points to only pickups and drop offs within NYC. New York city has an approximate longitude range of -74.05 to -73.75 and a latitude range of 40.63 to 40.85.
The dataset contains information regarding the pickup and drop off coordinates. However, there is no information regarding the distance between the pickup and drop off points. Therefore, we create a new feature that calculates the distance between each pair of pickup and drop off points. We can do this using the Euclidean Distance, which is the straight-line distance between any two coordinate points.
In [16]:
# TODO 2
def euclidean(params):
lon1, lat1, lon2, lat2 = params
londiff = lon2 - lon1
latdiff = lat2 - lat1
return tf.sqrt(londiff*londiff + latdiff*latdiff)
It is very important for numerical variables to get scaled before they are "fed" into the neural network. Here we use min-max scaling (also called normalization) on the geolocation fetures. Later in our model, you will see that these values are shifted and rescaled so that they end up ranging from 0 to 1.
First, we create a function named 'scale_longitude', where we pass in all the longitudinal values and add 78 to each value. Note that our scaling longitude ranges from -70 to -78. Thus, the value 78 is the maximum longitudinal value. The delta or difference between -70 and -78 is 8. We add 78 to each longitidunal value and then divide by 8 to return a scaled value.
In [17]:
def scale_longitude(lon_column):
return (lon_column + 78)/8.
Next, we create a function named 'scale_latitude', where we pass in all the latitudinal values and subtract 37 from each value. Note that our scaling longitude ranges from -37 to -45. Thus, the value 37 is the minimal latitudinal value. The delta or difference between -37 and -45 is 8. We subtract 37 from each latitudinal value and then divide by 8 to return a scaled value.
In [18]:
def scale_latitude(lat_column):
return (lat_column - 37)/8.
We now create two new "geo" functions for our model. We create a function called "euclidean" to initialize our geolocation parameters. We then create a function called transform. The transform function passes our numerical and string column features as inputs to the model, scales geolocation features, then creates the Euclian distance as a transformed variable with the geolocation features. Lastly, we bucketize the latitude and longitude features.
In [19]:
def transform(inputs, numeric_cols, string_cols, nbuckets):
print("Inputs before features transformation: {}".format(inputs.keys()))
# Pass-through columns
transformed = inputs.copy()
del transformed['pickup_datetime']
feature_columns = {
colname: tf.feature_column.numeric_column(colname)
for colname in numeric_cols
}
# Scaling longitude from range [-70, -78] to [0, 1]
for lon_col in ['pickup_longitude', 'dropoff_longitude']:
transformed[lon_col] = layers.Lambda(
scale_longitude,
name="scale_{}".format(lon_col))(inputs[lon_col])
# Scaling latitude from range [37, 45] to [0, 1]
for lat_col in ['pickup_latitude', 'dropoff_latitude']:
transformed[lat_col] = layers.Lambda(
scale_latitude,
name='scale_{}'.format(lat_col))(inputs[lat_col])
# TODO 2
# add Euclidean distance
transformed['euclidean'] = layers.Lambda(
euclidean,
name='euclidean')([inputs['pickup_longitude'],
inputs['pickup_latitude'],
inputs['dropoff_longitude'],
inputs['dropoff_latitude']])
feature_columns['euclidean'] = fc.numeric_column('euclidean')
# TODO 3
# create bucketized features
latbuckets = np.linspace(0, 1, nbuckets).tolist()
lonbuckets = np.linspace(0, 1, nbuckets).tolist()
b_plat = fc.bucketized_column(
feature_columns['pickup_latitude'], latbuckets)
b_dlat = fc.bucketized_column(
feature_columns['dropoff_latitude'], latbuckets)
b_plon = fc.bucketized_column(
feature_columns['pickup_longitude'], lonbuckets)
b_dlon = fc.bucketized_column(
feature_columns['dropoff_longitude'], lonbuckets)
# TODO 3
# create crossed columns
ploc = fc.crossed_column([b_plat, b_plon], nbuckets * nbuckets)
dloc = fc.crossed_column([b_dlat, b_dlon], nbuckets * nbuckets)
pd_pair = fc.crossed_column([ploc, dloc], nbuckets ** 4)
# create embedding columns
feature_columns['pickup_and_dropoff'] = fc.embedding_column(pd_pair, 100)
print("Transformed features: {}".format(transformed.keys()))
print("Feature columns: {}".format(feature_columns.keys()))
return transformed, feature_columns
Next, we'll create our DNN model now with the engineered features. We'll set NBUCKETS = 10
to specify 10 buckets when bucketizing the latitude and longitude.
In [20]:
NBUCKETS = 10
# DNN MODEL
def rmse(y_true, y_pred):
return tf.sqrt(tf.reduce_mean(tf.square(y_pred - y_true)))
def build_dnn_model():
# input layer is all float except for pickup_datetime which is a string
inputs = {
colname: layers.Input(name=colname, shape=(), dtype='float32')
for colname in NUMERIC_COLS
}
inputs.update({
colname: tf.keras.layers.Input(name=colname, shape=(), dtype='string')
for colname in STRING_COLS
})
# transforms
transformed, feature_columns = transform(inputs,
numeric_cols=NUMERIC_COLS,
string_cols=STRING_COLS,
nbuckets=NBUCKETS)
dnn_inputs = layers.DenseFeatures(feature_columns.values())(transformed)
# two hidden layers of [32, 8] just in like the BQML DNN
h1 = layers.Dense(32, activation='relu', name='h1')(dnn_inputs)
h2 = layers.Dense(8, activation='relu', name='h2')(h1)
# final output is a linear activation because this is regression
output = layers.Dense(1, activation='linear', name='fare')(h2)
model = models.Model(inputs, output)
# Compile model
model.compile(optimizer='adam', loss='mse', metrics=[rmse, 'mse'])
return model
In [21]:
model = build_dnn_model()
Let's see how our model architecture has changed now.
In [22]:
tf.keras.utils.plot_model(model, 'dnn_model_engineered.png', show_shapes=False, rankdir='LR')
Out[22]:
In [23]:
trainds = load_dataset('../data/taxi-train*',
TRAIN_BATCH_SIZE,
tf.estimator.ModeKeys.TRAIN)
evalds = load_dataset('../data/taxi-valid*',
1000,
tf.estimator.ModeKeys.EVAL).take(NUM_EVAL_EXAMPLES//1000)
steps_per_epoch = NUM_TRAIN_EXAMPLES // (TRAIN_BATCH_SIZE * NUM_EVALS)
history = model.fit(trainds,
validation_data=evalds,
epochs=NUM_EVALS+3,
steps_per_epoch=steps_per_epoch)
As before, let's visualize the DNN model layers.
In [24]:
plot_curves(history, ['loss', 'mse'])
Let's a prediction with this new model with engineered features on the example we had above.
In [25]:
model.predict({
'pickup_longitude': tf.convert_to_tensor([-73.982683]),
'pickup_latitude': tf.convert_to_tensor([40.742104]),
'dropoff_longitude': tf.convert_to_tensor([-73.983766]),
'dropoff_latitude': tf.convert_to_tensor([40.755174]),
'passenger_count': tf.convert_to_tensor([3.0]),
'pickup_datetime': tf.convert_to_tensor(['2010-02-08 09:17:00 UTC'], dtype=tf.string),
}, steps=1)
Out[25]:
Below we summarize our training results comparing our baseline model with our model with engineered features.
Model | Taxi Fare | Description |
---|---|---|
Baseline | 12.29 | Baseline model - no feature engineering |
Feature Engineered | 07.28 | Feature Engineered Model |
Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.