In [0]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Tensorflow Datasets: Slicing

Notebook orignially contributed by: Abhinav Prakash

This colab notebook contains a guide to use the TensorFlow Datasets slicing API. We're going to see how easily we can load a dataset and perform the split, slice and merge operations.

Note: Previously we're having two APIs to load the data i.e Legacy API and S3 API. But now the legacy split is deprecated and does not work anymore.

First let's import Tensorflow and Tensorflow datasets.


In [0]:
try:
  # %tensorflow_version only works in Colab.
  %tensorflow_version 2.x
except:
  pass

In [0]:
import tensorflow as tf
import tensorflow_datasets as tfds

In [0]:
print(tf.__version__)

Currently, there are all popular datasets present in the Tensorflow Dataset. We can get the list of all the datasets list_builders method. By calling this method, it returns the list of all the datasets available in Tensorflow Datasets. In this notebook, I'm going to use the mnist dataset for doing the operations.


In [0]:
print(tfds.list_builders())

In [0]:
# To check the mnist dataset present or not.
for mnist in tfds.list_builders():
  if(mnist=="mnist"):
    print("mnist dataset is present in tensorflow datasets")

Mnist dataset comes with 70000 records and it's having predefined splits, train split having 60000 records and the test split is having 10000 records. We can load this method by load method. We can pass the dataset name as a string to download it in the tfds.load . And we pass the list of the splits that we need to tfds.load in the split parameter.


In [0]:
train_ds, test_ds = tfds.load("mnist", split=["train","test"])

In [0]:
print("Number of Records in training set: {:,}".format(len(list(train_ds))))
print("Number of Records in test set: {:,}".format(len(list(test_ds))))

Slicing is the operation by which can construct a dataset that contains the required specific number of records from the original datasets. By using S3 API, We can do slicing operations in two ways:

  • By specifying the slice that we want with the familiar colon syntax from Python encoded in a string literal.
  • By passing the required percantage(%) in the split parameter.

In [0]:
# By using the colon syntax
# From record 10000 (included) to record 40000 (excluded) of `train` split.
dataset_1 = tfds.load('mnist', split='train[1000:4000]')
print("Number of Records in dataset_1 : {:,}".format(len(list(dataset_1))))

In [0]:
# The first 20 percentage from `train` split by passing required percentage as a parameter.
dataset_2 = tfds.load('mnist', split='train[:20%]')
print("Number of Records in dataset_2 : {:,}".format(len(list(dataset_2))))

Merging is the operation by which we can construct a dataset having required records from another datasets.

We can merge the train and test datset using S3 API. The resulting dataset after merging contains all the records from the train and the test splits.


In [0]:
# Full dataset contains 70000 records(60000(train)+ 10000(test)).
full_dataset = tfds.load('mnist', split='train+test')
print("Number of Records in full_dataset : {:,}".format(len(list(full_dataset))))

In [0]:
# dataset_3 contains first 20%  from 'train' split and last 30% from the 'test' split.
dataset_3 = tfds.load('mnist', split='train[:20%]+test[-70%:]')
print("Number of Records in dataset_3 : {:,}".format(len(list(dataset_3))))

This is how we can easily load, slice and merge the dataset using Tensorflow Dataset. We can also add the new dataset in Tesnorflow Dataset for helping the community. So that people like us can't spend much time in loading the datasets. You can see the Documentation here for adding the new datasets : Adding a Dataset And then send a pull request in Github here: Tensorflow/dataset