Loading large datasets

Learning Objectives

  • Understand difference between loading data entirely in-memory and loading in batches from disk
  • Practice loading a .csv file from disk in batches using the tf.data module

Introduction

In the previous notebook, we read the the whole taxifare .csv files into memory, specifically a Pandas dataframe, before invoking tf.data.from_tensor_slices from the tf.data API. We could get away with this because it was a small sample of the dataset, but on the full taxifare dataset this wouldn't be feasible.

In this notebook we demonstrate how to read .csv files directly from disk, one batch at a time, using tf.data.TextLineDataset

Run the following cell and restart the kernel if needed


In [ ]:
import tensorflow as tf
import shutil
print(tf.__version__)

In [ ]:
tf.enable_eager_execution()

Input function reading from CSV

We define read_dataset() which given a csv file path returns a tf.data.Dataset in which each row represents a (features,label) in the Estimator API required format

  • features: A python dictionary. Each key is a feature column name and its value is the tensor containing the data for that feature
  • label: A Tensor containing the labels

We then invoke read_dataset() function from within the train_input_fn() and eval_input_fn(). The remaining code is as before.

Exercise 1

In the next cell, implement a parse_row function that takes as input a csv row (as a string) and returns a tuple (features, labels) as described above.

First, use the tf.decode_csv function to read in the features from a csv file. Next, once fields has been read from the .csv file, create a dictionary of features and values. Lastly, define the label and remove it from the features dict you created. This can be done in one step with pythons pop operation.

The column names and the default values you'll need for these operations are given by global variables CSV_COLUMN_NAMES and CSV_DEFAULTS. The labels are stored in the first column.


In [ ]:
CSV_COLUMN_NAMES = ["fare_amount","dayofweek","hourofday","pickuplon","pickuplat","dropofflon","dropofflat"]
CSV_DEFAULTS = [[0.0],[1],[0],[-74.0], [40.0], [-74.0], [40.7]]

def parse_row(row):
    fields = # TODO: Your code goes here
    features = # TODO: Your code goes here
    labels = # TODO: Your code goes here
    return features, label

Run the following test to make sure your implementation is correct


In [ ]:
a_row = "0.0,1,0,-74.0,40.0,-74.0,40.7"
features, labels = parse_row(a_row)

assert labels.numpy() == 0.0
assert features["pickuplon"].numpy() == -74.0
print("You rock!")

Exercise 2

Use the function parse_row you implemented in the previous exercise to implement a read_dataset function that

  • takes as input the path to a csv file
  • returns a tf.data.Dataset object containing the features, labels

Assume that the .csv file has a header, and that your read_dataset will skip it. Have a look at the tf.data.TextLineDataset documentation to see what variables to pass when initializing the dataset pipeline. Then use the parse_row operation we created above to read the values from the .csv file


In [ ]:
def read_dataset(csv_path):  
    dataset = # TODO: Your code goes here
    dataset = # TODO: Your code goes here
    return dataset

Tests

Let's create a test dataset to test our function.


In [ ]:
%%writefile test.csv
fare_amount,dayofweek,hourofday,pickuplon,pickuplat,dropofflon,dropofflat
28,1,0,-73.0,41.0,-74.0,20.7
12.3,1,0,-72.0,44.0,-75.0,40.6
10,1,0,-71.0,41.0,-71.0,42.9

You should be able to iterate over what's returned by read_dataset. We'll print the dropofflat and fare_amount for each entry in ./test.csv


In [ ]:
for feature, label in read_dataset("./test.csv"):
    print("dropofflat:", feature["dropofflat"].numpy())
    print("fare_amount:", label.numpy())

Run the following test cell to make sure you function works properly:


In [ ]:
dataset= read_dataset("./test.csv")
dataset_iterator = dataset.make_one_shot_iterator()
features, labels = dataset_iterator.get_next()

assert features['dayofweek'].numpy() == 1
assert labels.numpy() == 28
print("You rock!")

Exercise 3

In the code cell below, implement a train_input_fn function that

  • takes as input a path to a csv file along with a batch_size
  • returns a dataset object that shuffle the rows and returns them in batches of batch_size

Hint: Reuse the read_dataset function you implemented above.

Once you've initialized the dataset, be sure to add a step to shuffle, repeat and batch to your pipeline.


In [ ]:
def train_input_fn(csv_path, batch_size = 128):
    dataset = # TODO: Your code goes here
    dataset = # TODO: Your code goes here
    return dataset

Exercise 4

Next, implement as eval_input_fn similar to the train_input_fn you implemented above. Remember, the only difference is that this function does not need to shuffle the rows.


In [ ]:
def eval_input_fn(csv_path, batch_size = 128):
    dataset = # TODO: Your code goes here
    dataset = # TODO: Your code goes here
    return dataset

Create feature columns

The features of our models are the following:


In [ ]:
FEATURE_NAMES = CSV_COLUMN_NAMES[1:] # all but first column
print(FEATURE_NAMES)

Exercise 5

In the cell below, create a variable called feature_cols which contains a list of the appropriate tf.feature_column to be passed to a tf.estimator.


In [ ]:
feature_cols = # TODO: Your code goes here
print(feature_cols)

Choose Estimator

Exercise 6

In the cell below, create an instance of a tf.estimator.DNNRegressor such that

  • it has two layers of 10 units each
  • it uses the features defined in the previous exercise
  • it saves the trained model into the directory ./taxi_trained
  • it has a random seed set to 1 for replicability and debugging

Have a look at the documentation for Tensorflow's DNNRegressor to remind you of the implementation.

Hint: Remember, the random seed is set by passing a tf.estimator.RunConfig object to the config parameter of the tf.estimator.


In [ ]:
OUTDIR = "taxi_trained"

model = # TODO: Your code goes here

Train

Next, we'll train the model.

Exercise 7

Complete the code in the cell below to train the DNNRegressor model you instantiated above on our data. Have a look at the documentation for the train method of the DNNRegressor to see what variables you should pass. You'll use the train_input_function you created above and the ./taxi-train.csv dataset.

If you train your model for 500 steps. How many epochs of the dataset does this represent?


In [ ]:
%%time
tf.logging.set_verbosity(tf.logging.INFO) # so loss is printed during training
shutil.rmtree(path = OUTDIR, ignore_errors = True) # start fresh each time

model.train(
    input_fn = # TODO: Your code goes here,
    steps = # TODO: Your code goes here
)

Evaluate

Lastly, we'll evaluate our model.

Exercise 8

In the cell below, evaluate the model using its .evaluate method and the eval_input_fn function you implemented above on the ./taxi-valid.csv dataset. Capture the result of running evaluation on the evaluation set in a variable called metrics. Then, extract the average_loss for the dictionary returned by model.evaluate and contained in metrics. This is the RMSE.


In [ ]:
metrics = # TODO: Your code goes here
print("RMSE on dataset = {}".format(# TODO: Your code goes here))

Challenge exercise

Create a neural network that is capable of finding the volume of a cylinder given the radius of its base (r) and its height (h). Assume that the radius and height of the cylinder are both in the range 0.5 to 2.0. Unlike in the challenge exercise for c_estimator.ipynb, assume that your measurements of r, h and V are all rounded off to the nearest 0.1. Simulate the necessary training dataset. This time, you will need a lot more data to get a good predictor.

Hint (highlight to see):

Create random values for r and h and compute V. Then, round off r, h and V (i.e., the volume is computed from the true value of r and h; it's only your measurement that is rounded off). Your dataset will consist of the round values of r, h and V. Do this for both the training and evaluation datasets.

Now modify the "noise" so that instead of just rounding off the value, there is up to a 10% error (uniformly distributed) in the measurement followed by rounding off.

Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License