In this notebook, we will create a machine learning model using tf.estimator and evaluate its performance. The dataset is rather small (7700 samples), so we can do it all in-memory. We will also simply pass the raw data in as-is.
In [ ]:
# Ensure the right version of Tensorflow is installed.
!pip freeze | grep tensorflow==2.1
In [ ]:
import tensorflow as tf
import pandas as pd
import numpy as np
import shutil
print(tf.__version__)
Read data created in the previous chapter.
In [ ]:
# In CSV, label is the first column, after the features, followed by the key
CSV_COLUMNS = ['fare_amount', 'pickuplon','pickuplat','dropofflon','dropofflat','passengers', 'key']
FEATURES = CSV_COLUMNS[1:len(CSV_COLUMNS) - 1]
LABEL = CSV_COLUMNS[0]
df_train = pd.read_csv('./taxi-train.csv', header = None, names = CSV_COLUMNS)
df_valid = pd.read_csv('./taxi-valid.csv', header = None, names = CSV_COLUMNS)
In [ ]:
def make_input_fn(df, num_epochs):
return tf.compat.v1.estimator.inputs.pandas_input_fn(
x = df,
y = df[LABEL],
batch_size = 128,
num_epochs = num_epochs,
shuffle = True,
queue_capacity = 1000,
num_threads = 1
)
In [ ]:
def make_feature_cols():
input_columns = [tf.feature_column.numeric_column(k) for k in FEATURES]
return input_columns
In [ ]:
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)
OUTDIR = 'taxi_trained'
shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time
model = tf.estimator.LinearRegressor(
feature_columns = make_feature_cols(), model_dir = OUTDIR)
model.train(input_fn = make_input_fn(df_train, num_epochs = 10))
Evaluate on the validation data (we should defer using the test data to after we have selected a final model).
In [ ]:
def print_rmse(model, name, df):
metrics = model.evaluate(input_fn = make_input_fn(df, 1))
print('RMSE on {} dataset = {}'.format(name, np.sqrt(metrics['average_loss'])))
print_rmse(model, 'validation', df_valid)
This is nowhere near our benchmark (RMSE of $6 or so on this data), but it serves to demonstrate what TensorFlow code looks like. Let's use this model for prediction.
In [ ]:
import itertools
# Read saved model and use it for prediction
model = tf.estimator.LinearRegressor(
feature_columns = make_feature_cols(), model_dir = OUTDIR)
preds_iter = model.predict(input_fn = make_input_fn(df_valid, 1))
print([pred['predictions'][0] for pred in list(itertools.islice(preds_iter, 5))])
This explains why the RMSE was so high -- the model essentially predicts the same amount for every trip. Would a more complex model help? Let's try using a deep neural network. The code to do this is quite straightforward as well.
In [ ]:
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)
shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time
model = tf.estimator.DNNRegressor(hidden_units = [32, 8, 2],
feature_columns = make_feature_cols(), model_dir = OUTDIR)
model.train(input_fn = make_input_fn(df_train, num_epochs = 100));
print_rmse(model, 'validation', df_valid)
We are not beating our benchmark with either model ... what's up? Well, we may be using TensorFlow for Machine Learning, but we are not yet using it well. That's what the rest of this course is about!
But, for the record, let's say we had to choose between the two models. We'd choose the one with the lower validation error. Finally, we'd measure the RMSE on the test data with this chosen model.
Let's do this on the benchmark dataset.
In [ ]:
from google.cloud import bigquery
import numpy as np
import pandas as pd
def create_query(phase, EVERY_N):
"""Creates a query with the proper splits.
Args:
phase: int, 1=train, 2=valid.
EVERY_N: int, take an example EVERY_N rows.
Returns:
Query string with the proper splits.
"""
base_query = """
WITH daynames AS
(SELECT ['Sun', 'Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat'] AS daysofweek)
SELECT
(tolls_amount + fare_amount) AS fare_amount,
daysofweek[ORDINAL(EXTRACT(DAYOFWEEK FROM pickup_datetime))] AS dayofweek,
EXTRACT(HOUR FROM pickup_datetime) AS hourofday,
pickup_longitude AS pickuplon,
pickup_latitude AS pickuplat,
dropoff_longitude AS dropofflon,
dropoff_latitude AS dropofflat,
passenger_count AS passengers,
'notneeded' AS key
FROM
`nyc-tlc.yellow.trips`, daynames
WHERE
trip_distance > 0 AND fare_amount > 0
"""
if EVERY_N is None:
if phase < 2:
# training
query = """{0} AND ABS(MOD(FARM_FINGERPRINT(CAST
(pickup_datetime AS STRING), 4)) < 2""".format(base_query)
else:
query = """{0} AND ABS(MOD(FARM_FINGERPRINT(CAST(
pickup_datetime AS STRING), 4)) = {1}""".format(base_query, phase)
else:
query = """{0} AND ABS(MOD(FARM_FINGERPRINT(CAST(
pickup_datetime AS STRING)), {1})) = {2}""".format(
base_query, EVERY_N, phase)
return query
query = create_query(2, 100000)
df = bigquery.Client().query(query).to_dataframe()
In [ ]:
print_rmse(model, 'benchmark', df)
RMSE on benchmark dataset is 10.63 (your results will vary because of random seeds).
This is not only way more than our original benchmark of 6.00, but it doesn't even beat our distance-based rule's RMSE of 8.02.
Fear not -- you have learned how to write a TensorFlow model, but not to do all the things that you will have to do to your ML model performant. We will do this in the next chapters. In this chapter though, we will get our TensorFlow model ready for these improvements.
In a software sense, the rest of the labs in this chapter will be about refactoring the code so that we can improve it.
Copyright 2017 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License