Welcome to the final assignment of this course! For this week, you will implement a distribution strategy to train on the Oxford Flowers 102 dataset. As the name suggests, distribution strategies allow you to setup training across multiple devices. We are just using a single device in this lab but the syntax you’ll apply should also work when you have a multi-device setup. Let’s begin!
# Uncomment the following lines if you're running this notebook on Colab. This is for compatibility with
# the autograder. No need to run these on Coursera.
!pip install tensorflow==2.8.0
!pip install keras==2.8.0
Collecting tensorflow==2.8.0
Downloading tensorflow-2.8.0-cp310-cp310-manylinux2010_x86_64.whl (497.6 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m497.6/497.6 MB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hRequirement already satisfied: absl-py>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.8.0) (1.4.0)
Requirement already satisfied: astunparse>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.8.0) (1.6.3)
Requirement already satisfied: flatbuffers>=1.12 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.8.0) (23.5.26)
Requirement already satisfied: gast>=0.2.1 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.8.0) (0.4.0)
Requirement already satisfied: google-pasta>=0.1.1 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.8.0) (0.2.0)
Requirement already satisfied: h5py>=2.9.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.8.0) (3.9.0)
Collecting keras-preprocessing>=1.1.1 (from tensorflow==2.8.0)
Downloading Keras_Preprocessing-1.1.2-py2.py3-none-any.whl (42 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.6/42.6 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hRequirement already satisfied: libclang>=9.0.1 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.8.0) (16.0.6)
Requirement already satisfied: numpy>=1.20 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.8.0) (1.23.5)
Requirement already satisfied: opt-einsum>=2.3.2 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.8.0) (3.3.0)
Requirement already satisfied: protobuf>=3.9.2 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.8.0) (3.20.3)
Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.8.0) (67.7.2)
Requirement already satisfied: six>=1.12.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.8.0) (1.16.0)
Requirement already satisfied: termcolor>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.8.0) (2.3.0)
Requirement already satisfied: typing-extensions>=3.6.6 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.8.0) (4.5.0)
Requirement already satisfied: wrapt>=1.11.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.8.0) (1.15.0)
Collecting tensorboard<2.9,>=2.8 (from tensorflow==2.8.0)
Downloading tensorboard-2.8.0-py3-none-any.whl (5.8 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m42.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tf-estimator-nightly==2.8.0.dev2021122109 (from tensorflow==2.8.0)
Downloading tf_estimator_nightly-2.8.0.dev2021122109-py2.py3-none-any.whl (462 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m462.5/462.5 kB[0m [31m36.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting keras<2.9,>=2.8.0rc0 (from tensorflow==2.8.0)
Downloading keras-2.8.0-py2.py3-none-any.whl (1.4 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m60.0 MB/s[0m eta [36m0:00:00[0m
[?25hRequirement already satisfied: tensorflow-io-gcs-filesystem>=0.23.1 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.8.0) (0.33.0)
Requirement already satisfied: grpcio<2.0,>=1.24.3 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.8.0) (1.57.0)
Requirement already satisfied: wheel<1.0,>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from astunparse>=1.6.0->tensorflow==2.8.0) (0.41.2)
Requirement already satisfied: google-auth<3,>=1.6.3 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.9,>=2.8->tensorflow==2.8.0) (2.17.3)
Collecting google-auth-oauthlib<0.5,>=0.4.1 (from tensorboard<2.9,>=2.8->tensorflow==2.8.0)
Downloading google_auth_oauthlib-0.4.6-py2.py3-none-any.whl (18 kB)
Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.9,>=2.8->tensorflow==2.8.0) (3.4.4)
Requirement already satisfied: requests<3,>=2.21.0 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.9,>=2.8->tensorflow==2.8.0) (2.31.0)
Collecting tensorboard-data-server<0.7.0,>=0.6.0 (from tensorboard<2.9,>=2.8->tensorflow==2.8.0)
Downloading tensorboard_data_server-0.6.1-py3-none-manylinux2010_x86_64.whl (4.9 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.9/4.9 MB[0m [31m34.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tensorboard-plugin-wit>=1.6.0 (from tensorboard<2.9,>=2.8->tensorflow==2.8.0)
Downloading tensorboard_plugin_wit-1.8.1-py3-none-any.whl (781 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m781.3/781.3 kB[0m [31m21.4 MB/s[0m eta [36m0:00:00[0m
[?25hRequirement already satisfied: werkzeug>=0.11.15 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.9,>=2.8->tensorflow==2.8.0) (2.3.7)
Requirement already satisfied: cachetools<6.0,>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from google-auth<3,>=1.6.3->tensorboard<2.9,>=2.8->tensorflow==2.8.0) (5.3.1)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.10/dist-packages (from google-auth<3,>=1.6.3->tensorboard<2.9,>=2.8->tensorflow==2.8.0) (0.3.0)
Requirement already satisfied: rsa<5,>=3.1.4 in /usr/local/lib/python3.10/dist-packages (from google-auth<3,>=1.6.3->tensorboard<2.9,>=2.8->tensorflow==2.8.0) (4.9)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.9,>=2.8->tensorflow==2.8.0) (1.3.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.21.0->tensorboard<2.9,>=2.8->tensorflow==2.8.0) (3.2.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.21.0->tensorboard<2.9,>=2.8->tensorflow==2.8.0) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.21.0->tensorboard<2.9,>=2.8->tensorflow==2.8.0) (2.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.21.0->tensorboard<2.9,>=2.8->tensorflow==2.8.0) (2023.7.22)
Requirement already satisfied: MarkupSafe>=2.1.1 in /usr/local/lib/python3.10/dist-packages (from werkzeug>=0.11.15->tensorboard<2.9,>=2.8->tensorflow==2.8.0) (2.1.3)
Requirement already satisfied: pyasn1<0.6.0,>=0.4.6 in /usr/local/lib/python3.10/dist-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard<2.9,>=2.8->tensorflow==2.8.0) (0.5.0)
Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.10/dist-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.9,>=2.8->tensorflow==2.8.0) (3.2.2)
Installing collected packages: tf-estimator-nightly, tensorboard-plugin-wit, keras, tensorboard-data-server, keras-preprocessing, google-auth-oauthlib, tensorboard, tensorflow
Attempting uninstall: keras
Found existing installation: keras 2.13.1
Uninstalling keras-2.13.1:
Successfully uninstalled keras-2.13.1
Attempting uninstall: tensorboard-data-server
Found existing installation: tensorboard-data-server 0.7.1
Uninstalling tensorboard-data-server-0.7.1:
Successfully uninstalled tensorboard-data-server-0.7.1
Attempting uninstall: google-auth-oauthlib
Found existing installation: google-auth-oauthlib 1.0.0
Uninstalling google-auth-oauthlib-1.0.0:
Successfully uninstalled google-auth-oauthlib-1.0.0
Attempting uninstall: tensorboard
Found existing installation: tensorboard 2.13.0
Uninstalling tensorboard-2.13.0:
Successfully uninstalled tensorboard-2.13.0
Attempting uninstall: tensorflow
Found existing installation: tensorflow 2.13.0
Uninstalling tensorflow-2.13.0:
Successfully uninstalled tensorflow-2.13.0
Successfully installed google-auth-oauthlib-0.4.6 keras-2.8.0 keras-preprocessing-1.1.2 tensorboard-2.8.0 tensorboard-data-server-0.6.1 tensorboard-plugin-wit-1.8.1 tensorflow-2.8.0 tf-estimator-nightly-2.8.0.dev2021122109
Requirement already satisfied: keras==2.8.0 in /usr/local/lib/python3.10/dist-packages (2.8.0)
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
import tensorflow_hub as hub
# Helper libraries
import numpy as np
import os
from tqdm import tqdm
import tensorflow_datasets as tfds
tfds.disable_progress_bar()
splits = ['train[:80%]', 'train[80%:90%]', 'train[90%:]']
(train_examples, validation_examples, test_examples), info = tfds.load('oxford_flowers102', with_info=True, as_supervised=True, split = splits, data_dir='data/')
num_examples = info.splits['train'].num_examples
num_classes = info.features['label'].num_classes
Downloading and preparing dataset 328.90 MiB (download: 328.90 MiB, generated: 331.34 MiB, total: 660.25 MiB) to data/oxford_flowers102/2.1.1...
Dataset oxford_flowers102 downloaded and prepared to data/oxford_flowers102/2.1.1. Subsequent calls will reuse this data.
How does tf.distribute.MirroredStrategy
strategy work?
# If the list of devices is not specified in the
# `tf.distribute.MirroredStrategy` constructor, it will be auto-detected.
strategy = tf.distribute.MirroredStrategy()
print('Number of devices: {}'.format(strategy.num_replicas_in_sync))
Number of devices: 1
Set some constants, including the buffer size, number of epochs, and the image size.
BUFFER_SIZE = num_examples
EPOCHS = 10
pixels = 224
# Path to the model features. Only use this when running the notebook on Coursera
# MODULE_HANDLE = 'data/resnet_50_feature_vector'
# Note: Uncomment the line below if you are running the notebook on Colab
MODULE_HANDLE='https://tfhub.dev/tensorflow/resnet_50/feature_vector/1'
IMAGE_SIZE = (pixels, pixels)
print("Using {} with input size {}".format(MODULE_HANDLE, IMAGE_SIZE))
Using https://tfhub.dev/tensorflow/resnet_50/feature_vector/1 with input size (224, 224)
Define a function to format the image (resizes the image and scales the pixel values to range from [0,1].
def format_image(image, label):
image = tf.image.resize(image, IMAGE_SIZE) / 255.0
return image, label
Given the batch size per replica and the strategy, set the global batch size.
Hint: You’ll want to use the num_replicas_in_sync
stored in the strategy.
# GRADED FUNCTION
def set_global_batch_size(batch_size_per_replica, strategy):
'''
Args:
batch_size_per_replica (int) - batch size per replica
strategy (tf.distribute.Strategy) - distribution strategy
'''
# set the global batch size
### START CODE HERE ###
global_batch_size = batch_size_per_replica * strategy.num_replicas_in_sync
### END CODD HERE ###
return global_batch_size
Set the GLOBAL_BATCH_SIZE with the function that you just defined
BATCH_SIZE_PER_REPLICA = 64
GLOBAL_BATCH_SIZE = set_global_batch_size(BATCH_SIZE_PER_REPLICA, strategy)
print(GLOBAL_BATCH_SIZE)
64
Expected Output:
64
Create the datasets using the global batch size and distribute the batches for training, validation and test batches
train_batches = train_examples.shuffle(num_examples // 4).map(format_image).batch(BATCH_SIZE_PER_REPLICA).prefetch(1)
validation_batches = validation_examples.map(format_image).batch(BATCH_SIZE_PER_REPLICA).prefetch(1)
test_batches = test_examples.map(format_image).batch(1)
Create the distributed datasets using experimental_distribute_dataset()
of the Strategy class and pass in the training batches.
# GRADED FUNCTION
def distribute_datasets(strategy, train_batches, validation_batches, test_batches):
### START CODE HERE ###
train_dist_dataset = strategy.experimental_distribute_dataset(train_batches)
val_dist_dataset = strategy.experimental_distribute_dataset(validation_batches)
test_dist_dataset = strategy.experimental_distribute_dataset(test_batches)
### END CODE HERE ###
return train_dist_dataset, val_dist_dataset, test_dist_dataset
Call the function that you just defined to get the distributed datasets.
train_dist_dataset, val_dist_dataset, test_dist_dataset = distribute_datasets(strategy, train_batches, validation_batches, test_batches)
Take a look at the types of the distributed datasets:
print(type(train_dist_dataset))
print(type(val_dist_dataset))
print(type(test_dist_dataset))
<class 'tensorflow.python.distribute.input_lib.DistributedDataset'>
<class 'tensorflow.python.distribute.input_lib.DistributedDataset'>
<class 'tensorflow.python.distribute.input_lib.DistributedDataset'>
Expected Output:
<class 'tensorflow.python.distribute.input_lib.DistributedDataset'>
<class 'tensorflow.python.distribute.input_lib.DistributedDataset'>
<class 'tensorflow.python.distribute.input_lib.DistributedDataset'>
Also get familiar with a single batch from the train_dist_dataset:
# Take a look at a single batch from the train_dist_dataset
x = iter(train_dist_dataset).get_next()
print(f"x is a tuple that contains {len(x)} values ")
print(f"x[0] contains the features, and has shape {x[0].shape}")
print(f" so it has {x[0].shape[0]} examples in the batch, each is an image that is {x[0].shape[1:]}")
print(f"x[1] contains the labels, and has shape {x[1].shape}")
x is a tuple that contains 2 values
x[0] contains the features, and has shape (64, 224, 224, 3)
so it has 64 examples in the batch, each is an image that is (224, 224, 3)
x[1] contains the labels, and has shape (64,)
Use the Model Subclassing API to create model ResNetModel
as a subclass of tf.keras.Model
.
class ResNetModel(tf.keras.Model):
def __init__(self, classes):
super(ResNetModel, self).__init__()
self._feature_extractor = hub.KerasLayer(MODULE_HANDLE,
trainable=False)
self._classifier = tf.keras.layers.Dense(classes, activation='softmax')
def call(self, inputs):
x = self._feature_extractor(inputs)
x = self._classifier(x)
return x
Create a checkpoint directory to store the checkpoints (the model’s weights during training).
# Create a checkpoint directory to store the checkpoints.
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
You’ll define the loss_object
and compute_loss
within the strategy.scope()
.
loss_object
will be used later to calculate the loss on the test set.compute_loss
will be used later to calculate the average loss on the training data.You will be using these two loss calculations later.
with strategy.scope():
# Set reduction to `NONE` so we can do the reduction afterwards and divide by
# the global batch size.
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
reduction=tf.keras.losses.Reduction.NONE)
# or loss_fn = tf.keras.losses.sparse_categorical_crossentropy
def compute_loss(labels, predictions):
per_example_loss = loss_object(labels, predictions)
return tf.nn.compute_average_loss(per_example_loss, global_batch_size=GLOBAL_BATCH_SIZE)
test_loss = tf.keras.metrics.Mean(name='test_loss')
These metrics track the test loss and training and test accuracy.
.result()
to get the accumulated statistics at any time, for example, train_accuracy.result()
.with strategy.scope():
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
name='train_accuracy')
test_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
name='test_accuracy')
This code is given to you. Just remember that they are created within the strategy.scope()
.
Note: If you are running this on Colab and get the error message: OSError: data/resnet_50_feature_vector does not exist
, please scroll up to the Setup Input Pipeline
section and uncomment the MODULE_HANDLE
line for Colab. Then restart the runtime and run all cells.
# model and optimizer must be created under `strategy.scope`.
with strategy.scope():
model = ResNetModel(classes=num_classes)
optimizer = tf.keras.optimizers.Adam()
checkpoint = tf.train.Checkpoint(optimizer=optimizer, model=model)
You will define a regular training step and test step, which could work without a distributed strategy. You can then use strategy.run
to apply these functions in a distributed manner.
train_step
and test_step
inside another function train_testp_step_fns
, which will then return these two functions.Within the strategy’s scope, define train_step(inputs)
inputs
will be a tuple containing (images, labels)
.True
(complete this part).compute_loss
function (defined earlier) to compute the training loss (complete this part).Also within the strategy’s scope, define test_step(inputs)
inputs
is a tuple containing (images, labels)
.
False
, because the model is not going to train on the test data. (complete this part).loss_object
, which will compute the test loss. Check compute_loss
, defined earlier, to see what parameters to pass into loss_object
. (complete this part).test_loss
(the running test loss) with the t_loss
(the loss for the current batch).test_accuracy
.# GRADED FUNCTION
def train_test_step_fns(strategy, model, compute_loss, optimizer, train_accuracy, loss_object, test_loss, test_accuracy):
with strategy.scope():
def train_step(inputs):
images, labels = inputs
with tf.GradientTape() as tape:
### START CODE HERE ###
predictions = model(images, training=True)
loss = compute_loss(labels, predictions)
### END CODE HERE ###
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
train_accuracy.update_state(labels, predictions)
return loss
def test_step(inputs):
images, labels = inputs
### START CODE HERE ###
predictions = model(images, training=False)
t_loss = loss_object(labels, predictions)
### END CODE HERE ###
test_loss.update_state(t_loss)
test_accuracy.update_state(labels, predictions)
return train_step, test_step
Use the train_test_step_fns
function to produce the train_step
and test_step
functions.
train_step, test_step = train_test_step_fns(strategy, model, compute_loss, optimizer, train_accuracy, loss_object, test_loss, test_accuracy)
The train_step
and test_step
could be used in a non-distributed, regular model training. To apply them in a distributed way, you’ll use strategy.run.
distributed_train_step
run
function of the strategy
, passing in the train step function (which you defined earlier), as well as the arguments that go in the train step function.run(fn, args=() )
.
args
will take in the dataset inputsdistributed_test_step
run
function of your strategy, taking in the test step function as well as the dataset inputs that go into the test step function.train_dist_dataset
is tuple with two values:
Let’s think about how you’ll want to pass in the dataset inputs into args
by running this next cell of code:
#See various ways of passing in the inputs
def fun1(args=()):
print(f"number of arguments passed is {len(args)}")
list_of_inputs = [1,2]
print("When passing in args=list_of_inputs:")
fun1(args=list_of_inputs)
print()
print("When passing in args=(list_of_inputs)")
fun1(args=(list_of_inputs))
print()
print("When passing in args=(list_of_inputs,)")
fun1(args=(list_of_inputs,))
When passing in args=list_of_inputs:
number of arguments passed is 2
When passing in args=(list_of_inputs)
number of arguments passed is 2
When passing in args=(list_of_inputs,)
number of arguments passed is 1
Notice that depending on how list_of_inputs
is passed to args
affects whether fun1
sees one or two positional arguments.
run
.Please complete the following function.
def distributed_train_test_step_fns(strategy, train_step, test_step, model, compute_loss, optimizer, train_accuracy, loss_object, test_loss, test_accuracy):
with strategy.scope():
@tf.function
def distributed_train_step(dataset_inputs):
### START CODE HERE ###
per_replica_losses = strategy.run(train_step, args=(dataset_inputs,))
### END CODE HERE ###
return strategy.reduce(tf.distribute.ReduceOp.SUM, per_replica_losses,
axis=None)
@tf.function
def distributed_test_step(dataset_inputs):
### START CODE HERE ###
return strategy.run(test_step, args=(dataset_inputs,))
### END CODE HERE ###
return distributed_train_step, distributed_test_step
Call the function that you just defined to get the distributed train step function and distributed test step function.
distributed_train_step, distributed_test_step = distributed_train_test_step_fns(strategy, train_step, test_step, model, compute_loss, optimizer, train_accuracy, loss_object, test_loss, test_accuracy)
An important note before you continue:
The following sections will guide you through how to train your model and save it to a .zip file. These sections are not required for you to pass this assignment but you are encouraged to continue anyway. If you consider no more work is needed in previous sections, please submit now and carry on.
After training your model, you can download it as a .zip file and upload it back to the platform to know how well it performed. However, training your model takes around 20 minutes within the Coursera environment. Because of this, there are two methods to train your model:
Method 1
If 20 mins is too long for you, we recommend to download this notebook (after submitting it for grading) and upload it to Colab to finish the training in a GPU-enabled runtime. If you decide to do this, these are the steps to follow:
jupyter
logo on the upper left corner of the window. This will take you to the Jupyter workspace.Shutdown
.upload
tab and upload your notebook.Runtime
–> Change Runtime Type
and make sure that GPU
is enabled.MODULE_HANDLE
in the Setup input pipeline
section that contains the URL to the feature vector.Method 2
If you prefer to wait the 20 minutes and not leave Coursera, keep going through this notebook. Once you are done, follow these steps:
jupyter
logo on the upper left corner of the window. This will take you to the jupyter filesystem.mymodel.zip
. Go ahead and download it.Independent of the method you choose, you should end up with a mymodel.zip
file which can be uploaded for evaluation after this assignment. Once again, this is optional but we strongly encourage you to do it as it is a lot of fun.
With this out of the way, let’s continue.
You’ll now use a for-loop to go through the desired number of epochs and train the model in a distributed manner. In each epoch:
distributed_train_step
and get the loss.# Running this cell in Coursera takes around 20 mins
with strategy.scope():
for epoch in range(EPOCHS):
# TRAIN LOOP
total_loss = 0.0
num_batches = 0
for x in tqdm(train_dist_dataset):
total_loss += distributed_train_step(x)
num_batches += 1
train_loss = total_loss / num_batches
# TEST LOOP
for x in test_dist_dataset:
distributed_test_step(x)
template = ("Epoch {}, Loss: {}, Accuracy: {}, Test Loss: {}, "
"Test Accuracy: {}")
print (template.format(epoch+1, train_loss,
train_accuracy.result()*100, test_loss.result(),
test_accuracy.result()*100))
test_loss.reset_states()
train_accuracy.reset_states()
test_accuracy.reset_states()
13it [00:26, 2.07s/it]
Epoch 1, Loss: 4.624648094177246, Accuracy: 5.392157077789307, Test Loss: 3.8992528915405273, Test Accuracy: 11.764705657958984
13it [00:02, 4.43it/s]
Epoch 2, Loss: 2.5391149520874023, Accuracy: 54.16666793823242, Test Loss: 2.7988758087158203, Test Accuracy: 43.13725662231445
13it [00:02, 4.44it/s]
Epoch 3, Loss: 1.4021223783493042, Accuracy: 85.17156982421875, Test Loss: 2.218491315841675, Test Accuracy: 51.960784912109375
13it [00:05, 2.51it/s]
Epoch 4, Loss: 0.8167105317115784, Accuracy: 95.22058868408203, Test Loss: 1.8021492958068848, Test Accuracy: 60.78431701660156
13it [00:02, 4.47it/s]
Epoch 5, Loss: 0.5271885991096497, Accuracy: 97.30392456054688, Test Loss: 1.6222103834152222, Test Accuracy: 64.70588684082031
13it [00:02, 4.45it/s]
Epoch 6, Loss: 0.3695581555366516, Accuracy: 98.28431701660156, Test Loss: 1.4777246713638306, Test Accuracy: 65.68627166748047
13it [00:03, 3.49it/s]
Epoch 7, Loss: 0.2741614282131195, Accuracy: 99.01960754394531, Test Loss: 1.4114176034927368, Test Accuracy: 68.62745666503906
13it [00:03, 4.33it/s]
Epoch 8, Loss: 0.21502599120140076, Accuracy: 99.63235473632812, Test Loss: 1.3599189519882202, Test Accuracy: 70.5882339477539
13it [00:03, 4.33it/s]
Epoch 9, Loss: 0.1721489429473877, Accuracy: 99.75489807128906, Test Loss: 1.3026074171066284, Test Accuracy: 71.5686264038086
13it [00:03, 3.62it/s]
Epoch 10, Loss: 0.14168773591518402, Accuracy: 99.75489807128906, Test Loss: 1.2739832401275635, Test Accuracy: 70.5882339477539
Things to note in the example above:
train_dist_dataset
and test_dist_dataset
using a for x in ...
construct.distributed_train_step
. This value is aggregated across replicas using the tf.distribute.Strategy.reduce
call and then across batches by summing the return value of the tf.distribute.Strategy.reduce
calls.tf.keras.Metrics
should be updated inside train_step
and test_step
that gets executed by tf.distribute.Strategy.experimental_run_v2
.
*tf.distribute.Strategy.experimental_run_v2
returns results from each local replica in the strategy, and there are multiple ways to consume this result. You can do tf.distribute.Strategy.reduce
to get an aggregated value. You can also do tf.distribute.Strategy.experimental_local_results
to get the list of values contained in the result, one per local replica.You’ll get a saved model of this trained model. You’ll then need to zip that to upload it to the testing infrastructure. We provide the code to help you with that here:
This code will save your model as a SavedModel
model_save_path = "./tmp/mymodel/1/"
tf.saved_model.save(model, model_save_path)
This code will zip your saved model directory contents into a single file.
If you are on colab, you can use the file browser pane to the left of colab to find mymodel.zip
. Right click on it and select ‘Download’.
If the download fails because you aren’t allowed to download multiple files from colab, check out the guidance here: https://ccm.net/faq/32938-google-chrome-allow-websites-to-perform-simultaneous-downloads
If you are in Coursera, follow the instructions previously provided.
It’s a large file, so it might take some time to download.
import os
import zipfile
def zipdir(path, ziph):
# ziph is zipfile handle
for root, dirs, files in os.walk(path):
for file in files:
ziph.write(os.path.join(root, file))
zipf = zipfile.ZipFile('./mymodel.zip', 'w', zipfile.ZIP_DEFLATED)
zipdir('./tmp/mymodel/1/', zipf)
zipf.close()