Coursera

#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Exploring the Splits API

Setup

We’ll start by importing TensorFlow and TensorFlow Datasets.

try:
    %tensorflow_version 2.x
except:
    pass

Colab only includes TensorFlow 2.x; %tensorflow_version has no effect.

import tensorflow as tf
import tensorflow_datasets as tfds

print("\u2022 Using TensorFlow Version:", tf.__version__)

• Using TensorFlow Version: 2.14.0

Exploring the Splits API

train_ds, test_ds = tfds.load('mnist:3.*.*', split=['train', 'test'])

print(len(list(train_ds)))
print(len(list(test_ds)))

Downloading and preparing dataset 11.06 MiB (download: 11.06 MiB, generated: 21.00 MiB, total: 32.06 MiB) to /root/tensorflow_datasets/mnist/3.0.1...



Dl Completed...:   0%|          | 0/5 [00:00<?, ? file/s]


Dataset mnist downloaded and prepared to /root/tensorflow_datasets/mnist/3.0.1. Subsequent calls will reuse this data.
60000
10000

With the slicing API we can use strings to specify the slicing instructions. For example, in the cell below we will merge the training and test sets by passing the string ’train+test' to the split argument.

combined = tfds.load('mnist:3.*.*', split='train+test')

print(len(list(combined)))

We can also use Python style list slicers to specify the data we want. For example, we can specify that we want to take the first 10,000 records of the train split with the string 'train[:10000]', as shown below:

first10k = tfds.load('mnist:3.*.*', split='train[:10000]')

print(len(list(first10k)))

It also allows us to specify the percentage of the data we want to use. For example, we can select the first 20% of the training set with the string 'train[:20%]', as shown below:

first20p = tfds.load('mnist:3.*.*', split='train[:20%]')

print(len(list(first20p)))

We can see that first20p contains 12,000 records, which is indeed 20% the total number of records in the training set. Recall that the training set contains 60,000 records.

Because the slices are string-based we can use loops, like the ones shown below, to slice up the dataset and make some pretty complex splits. For example, the loops below create 10 complimentary validation and training sets (each loop returns a list with 5 data sets).

val_ds = tfds.load('mnist:3.*.*', split=['train[{}%:{}%]'.format(k, k+20) for k in range(0, 100, 20)])

train_ds = tfds.load('mnist:3.*.*', split=['train[:{}%]+train[{}%:]'.format(k, k+20) for k in range(0, 100, 20)])

val_ds

[<_PrefetchDataset element_spec={'image': TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None)}>,
 <_PrefetchDataset element_spec={'image': TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None)}>,
 <_PrefetchDataset element_spec={'image': TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None)}>,
 <_PrefetchDataset element_spec={'image': TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None)}>,
 <_PrefetchDataset element_spec={'image': TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None)}>]

train_ds

[<_PrefetchDataset element_spec={'image': TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None)}>,
 <_PrefetchDataset element_spec={'image': TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None)}>,
 <_PrefetchDataset element_spec={'image': TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None)}>,
 <_PrefetchDataset element_spec={'image': TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None)}>,
 <_PrefetchDataset element_spec={'image': TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None)}>]

print(len(list(val_ds)))
print(len(list(train_ds)))

5
5

We can also compose new datasets by using pieces from different splits. For example, we can create a new dataset from the first 10% of the test set and the last 80% of the training set, as shown below.

composed_ds = tfds.load('mnist:3.*.*', split='test[:10%]+train[-80%:]')

print(len(list(composed_ds)))