#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
We’ll start by importing TensorFlow and TensorFlow Datasets.
try:
%tensorflow_version 2.x
except:
pass
Colab only includes TensorFlow 2.x; %tensorflow_version has no effect.
import tensorflow as tf
import tensorflow_datasets as tfds
print("\u2022 Using TensorFlow Version:", tf.__version__)
• Using TensorFlow Version: 2.14.0
train_ds, test_ds = tfds.load('mnist:3.*.*', split=['train', 'test'])
print(len(list(train_ds)))
print(len(list(test_ds)))
Downloading and preparing dataset 11.06 MiB (download: 11.06 MiB, generated: 21.00 MiB, total: 32.06 MiB) to /root/tensorflow_datasets/mnist/3.0.1...
Dl Completed...: 0%| | 0/5 [00:00<?, ? file/s]
Dataset mnist downloaded and prepared to /root/tensorflow_datasets/mnist/3.0.1. Subsequent calls will reuse this data.
60000
10000
With the slicing API we can use strings to specify the slicing instructions. For example, in the cell below we will merge the training and test sets by passing the string ’train+test'
to the split
argument.
combined = tfds.load('mnist:3.*.*', split='train+test')
print(len(list(combined)))
70000
We can also use Python style list slicers to specify the data we want. For example, we can specify that we want to take the first 10,000 records of the train
split with the string 'train[:10000]'
, as shown below:
first10k = tfds.load('mnist:3.*.*', split='train[:10000]')
print(len(list(first10k)))
10000
It also allows us to specify the percentage of the data we want to use. For example, we can select the first 20% of the training set with the string 'train[:20%]'
, as shown below:
first20p = tfds.load('mnist:3.*.*', split='train[:20%]')
print(len(list(first20p)))
12000
We can see that first20p
contains 12,000 records, which is indeed 20% the total number of records in the training set. Recall that the training set contains 60,000 records.
Because the slices are string-based we can use loops, like the ones shown below, to slice up the dataset and make some pretty complex splits. For example, the loops below create 10 complimentary validation and training sets (each loop returns a list with 5 data sets).
val_ds = tfds.load('mnist:3.*.*', split=['train[{}%:{}%]'.format(k, k+20) for k in range(0, 100, 20)])
train_ds = tfds.load('mnist:3.*.*', split=['train[:{}%]+train[{}%:]'.format(k, k+20) for k in range(0, 100, 20)])
val_ds
[<_PrefetchDataset element_spec={'image': TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None)}>,
<_PrefetchDataset element_spec={'image': TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None)}>,
<_PrefetchDataset element_spec={'image': TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None)}>,
<_PrefetchDataset element_spec={'image': TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None)}>,
<_PrefetchDataset element_spec={'image': TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None)}>]
train_ds
[<_PrefetchDataset element_spec={'image': TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None)}>,
<_PrefetchDataset element_spec={'image': TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None)}>,
<_PrefetchDataset element_spec={'image': TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None)}>,
<_PrefetchDataset element_spec={'image': TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None)}>,
<_PrefetchDataset element_spec={'image': TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None)}>]
print(len(list(val_ds)))
print(len(list(train_ds)))
5
5
We can also compose new datasets by using pieces from different splits. For example, we can create a new dataset from the first 10% of the test set and the last 80% of the training set, as shown below.
composed_ds = tfds.load('mnist:3.*.*', split='test[:10%]+train[-80%:]')
print(len(list(composed_ds)))
49000