Learning Objectives
.csv file from disk in batches using the tf.data moduleIn the previous notebook, we read the the whole taxifare .csv files into memory, specifically a Pandas dataframe, before invoking tf.data.from_tensor_slices from the tf.data API. We could get away with this because it was a small sample of the dataset, but on the full taxifare dataset this wouldn't be feasible.
In this notebook we demonstrate how to read .csv files directly from disk, one batch at a time, using tf.data.TextLineDataset
Run the following cell and restart the kernel if needed:
In [ ]:
import tensorflow as tf
import shutil
print(tf.__version__)
In [ ]:
tf.enable_eager_execution()
We define read_dataset() which given a csv file path returns a tf.data.Dataset in which each row represents a (features,label) in the Estimator API required format
We then invoke read_dataset() function from within the train_input_fn() and eval_input_fn(). The remaining code is as before.
In [ ]:
CSV_COLUMN_NAMES = ["fare_amount","dayofweek","hourofday","pickuplon","pickuplat","dropofflon","dropofflat"]
CSV_DEFAULTS = [[0.0],[1],[0],[-74.0], [40.0], [-74.0], [40.7]]
def parse_row(row):
fields = tf.decode_csv(records = row, record_defaults = CSV_DEFAULTS)
features = dict(zip(CSV_COLUMN_NAMES, fields))
label = features.pop("fare_amount") # remove label from features and store
return features, label
Run the following test to make sure your implementation is correct
In [ ]:
a_row = "0.0,1,0,-74.0,40.0,-74.0,40.7"
features, labels = parse_row(a_row)
assert labels.numpy() == 0.0
assert features["pickuplon"].numpy() == -74.0
print("You rock!")
We'll use the function parse_row we implemented above to
implement a read_dataset function that
tf.data.Dataset object containing the features, labelsWe can assume that the .csv file has a header, and that your read_dataset will skip it.
In [ ]:
def read_dataset(csv_path):
dataset = tf.data.TextLineDataset(filenames = csv_path).skip(count = 1) # skip header
dataset = dataset.map(map_func = parse_row)
return dataset
Let's create a test dataset to test our function.
In [ ]:
%%writefile test.csv
fare_amount,dayofweek,hourofday,pickuplon,pickuplat,dropofflon,dropofflat
28,1,0,-73.0,41.0,-74.0,20.7
12.3,1,0,-72.0,44.0,-75.0,40.6
10,1,0,-71.0,41.0,-71.0,42.9
You should be able to iterate over what's returned by read_dataset. We'll print the dropofflat and fare_amount for each entry in ./test.csv
In [ ]:
for feature, label in read_dataset("./test.csv"):
print("dropofflat:", feature["dropofflat"].numpy())
print("fare_amount:", label.numpy())
Run the following test cell to make sure your function works properly:
In [ ]:
dataset= read_dataset("./test.csv")
dataset_iterator = dataset.make_one_shot_iterator()
features, labels = dataset_iterator.get_next()
assert features["dayofweek"].numpy() == 1
assert labels.numpy() == 28
print("You rock!")
Next we can implement a train_input_fn function that
batch_sizeWe'll reuse the read_dataset function you implemented above.
In [ ]:
def train_input_fn(csv_path, batch_size = 128):
dataset = read_dataset(csv_path)
dataset = dataset.shuffle(buffer_size = 1000).repeat(count = None).batch(batch_size = batch_size)
return dataset
Next, we implement a eval_input_fn simlar to train_input_fn you implemented above.
The only difference is that this function does not need to shuffle the rows.
In [ ]:
def eval_input_fn(csv_path, batch_size = 128):
dataset = read_dataset(csv_path)
dataset = dataset.batch(batch_size = batch_size)
return dataset
The features of our models are the following:
In [ ]:
FEATURE_NAMES = CSV_COLUMN_NAMES[1:] # all but first column
print(FEATURE_NAMES)
In the cell below, create a variable feature_cols containing a
list of the appropriate tf.feature_column to be passed to a tf.estimator:
In [ ]:
feature_cols = [tf.feature_column.numeric_column(key = k) for k in FEATURE_NAMES]
print(feature_cols)
Next, we create an instance of a tf.estimator.DNNRegressor such that
./taxi_trainedNote that we can set the random seed by passing a tf.estimator.RunConfig object to the config parameter of the tf.estimator.
In [ ]:
OUTDIR = "taxi_trained"
model = tf.estimator.DNNRegressor(
hidden_units = [10,10], # specify neural architecture
feature_columns = feature_cols,
model_dir = OUTDIR,
config = tf.estimator.RunConfig(tf_random_seed = 1)
)
With the model defined, we can now train the model on our data. In the cell below, we train the model you defined above using the train_input_function on ./tazi-train.csv for 500 steps. How many epochs of our data does this represent?
In [ ]:
%%time
tf.logging.set_verbosity(tf.logging.INFO) # so loss is printed during training
shutil.rmtree(path = OUTDIR, ignore_errors = True) # start fresh each time
model.train(
input_fn = lambda: train_input_fn(csv_path = "./taxi-train.csv"),
steps = 500
)
Finally, we'll evaluate the performance of our model on the validation set. We evaluate the model using its .evaluate method and
the eval_input_fn function you implemented above on the ./taxi-valid.csv dataset. Note, we make sure to extract the average_loss for the dictionary returned by model.evaluate. It is the RMSE.
In [ ]:
metrics = model.evaluate(input_fn = lambda: eval_input_fn(csv_path = "./taxi-valid.csv"))
print("RMSE on dataset = {}".format(metrics["average_loss"]**.5))
Create a neural network that is capable of finding the volume of a cylinder given the radius of its base (r) and its height (h). Assume that the radius and height of the cylinder are both in the range 0.5 to 2.0. Unlike in the challenge exercise for b_estimator.ipynb, assume that your measurements of r, h and V are all rounded off to the nearest 0.1. Simulate the necessary training dataset. This time, you will need a lot more data to get a good predictor.
Hint (highlight to see):
Create random values for r and h and compute V. Then, round off r, h and V (i.e., the volume is computed from the true value of r and h; it's only your measurement that is rounded off). Your dataset will consist of the round values of r, h and V. Do this for both the training and evaluation datasets.
Now modify the "noise" so that instead of just rounding off the value, there is up to a 10% error (uniformly distributed) in the measurement followed by rounding off.
Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License