Coursera

Week 4 Assignment: Custom training with tf.distribute.Strategy

Welcome to the final assignment of this course! For this week, you will implement a distribution strategy to train on the Oxford Flowers 102 dataset. As the name suggests, distribution strategies allow you to setup training across multiple devices. We are just using a single device in this lab but the syntax you’ll apply should also work when you have a multi-device setup. Let’s begin!

Imports

# Uncomment the following lines if you're running this notebook on Colab. This is for compatibility with
# the autograder. No need to run these on Coursera.

!pip install tensorflow==2.8.0
!pip install keras==2.8.0

Collecting tensorflow==2.8.0
  Downloading tensorflow-2.8.0-cp310-cp310-manylinux2010_x86_64.whl (497.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m497.6/497.6 MB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hRequirement already satisfied: absl-py>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.8.0) (1.4.0)
Requirement already satisfied: astunparse>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.8.0) (1.6.3)
Requirement already satisfied: flatbuffers>=1.12 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.8.0) (23.5.26)
Requirement already satisfied: gast>=0.2.1 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.8.0) (0.4.0)
Requirement already satisfied: google-pasta>=0.1.1 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.8.0) (0.2.0)
Requirement already satisfied: h5py>=2.9.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.8.0) (3.9.0)
Collecting keras-preprocessing>=1.1.1 (from tensorflow==2.8.0)
  Downloading Keras_Preprocessing-1.1.2-py2.py3-none-any.whl (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.6/42.6 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hRequirement already satisfied: libclang>=9.0.1 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.8.0) (16.0.6)
Requirement already satisfied: numpy>=1.20 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.8.0) (1.23.5)
Requirement already satisfied: opt-einsum>=2.3.2 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.8.0) (3.3.0)
Requirement already satisfied: protobuf>=3.9.2 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.8.0) (3.20.3)
Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.8.0) (67.7.2)
Requirement already satisfied: six>=1.12.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.8.0) (1.16.0)
Requirement already satisfied: termcolor>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.8.0) (2.3.0)
Requirement already satisfied: typing-extensions>=3.6.6 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.8.0) (4.5.0)
Requirement already satisfied: wrapt>=1.11.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.8.0) (1.15.0)
Collecting tensorboard<2.9,>=2.8 (from tensorflow==2.8.0)
  Downloading tensorboard-2.8.0-py3-none-any.whl (5.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m42.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tf-estimator-nightly==2.8.0.dev2021122109 (from tensorflow==2.8.0)
  Downloading tf_estimator_nightly-2.8.0.dev2021122109-py2.py3-none-any.whl (462 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m462.5/462.5 kB[0m [31m36.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting keras<2.9,>=2.8.0rc0 (from tensorflow==2.8.0)
  Downloading keras-2.8.0-py2.py3-none-any.whl (1.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m60.0 MB/s[0m eta [36m0:00:00[0m
[?25hRequirement already satisfied: tensorflow-io-gcs-filesystem>=0.23.1 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.8.0) (0.33.0)
Requirement already satisfied: grpcio<2.0,>=1.24.3 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.8.0) (1.57.0)
Requirement already satisfied: wheel<1.0,>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from astunparse>=1.6.0->tensorflow==2.8.0) (0.41.2)
Requirement already satisfied: google-auth<3,>=1.6.3 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.9,>=2.8->tensorflow==2.8.0) (2.17.3)
Collecting google-auth-oauthlib<0.5,>=0.4.1 (from tensorboard<2.9,>=2.8->tensorflow==2.8.0)
  Downloading google_auth_oauthlib-0.4.6-py2.py3-none-any.whl (18 kB)
Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.9,>=2.8->tensorflow==2.8.0) (3.4.4)
Requirement already satisfied: requests<3,>=2.21.0 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.9,>=2.8->tensorflow==2.8.0) (2.31.0)
Collecting tensorboard-data-server<0.7.0,>=0.6.0 (from tensorboard<2.9,>=2.8->tensorflow==2.8.0)
  Downloading tensorboard_data_server-0.6.1-py3-none-manylinux2010_x86_64.whl (4.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.9/4.9 MB[0m [31m34.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tensorboard-plugin-wit>=1.6.0 (from tensorboard<2.9,>=2.8->tensorflow==2.8.0)
  Downloading tensorboard_plugin_wit-1.8.1-py3-none-any.whl (781 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m781.3/781.3 kB[0m [31m21.4 MB/s[0m eta [36m0:00:00[0m
[?25hRequirement already satisfied: werkzeug>=0.11.15 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.9,>=2.8->tensorflow==2.8.0) (2.3.7)
Requirement already satisfied: cachetools<6.0,>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from google-auth<3,>=1.6.3->tensorboard<2.9,>=2.8->tensorflow==2.8.0) (5.3.1)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.10/dist-packages (from google-auth<3,>=1.6.3->tensorboard<2.9,>=2.8->tensorflow==2.8.0) (0.3.0)
Requirement already satisfied: rsa<5,>=3.1.4 in /usr/local/lib/python3.10/dist-packages (from google-auth<3,>=1.6.3->tensorboard<2.9,>=2.8->tensorflow==2.8.0) (4.9)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.9,>=2.8->tensorflow==2.8.0) (1.3.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.21.0->tensorboard<2.9,>=2.8->tensorflow==2.8.0) (3.2.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.21.0->tensorboard<2.9,>=2.8->tensorflow==2.8.0) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.21.0->tensorboard<2.9,>=2.8->tensorflow==2.8.0) (2.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.21.0->tensorboard<2.9,>=2.8->tensorflow==2.8.0) (2023.7.22)
Requirement already satisfied: MarkupSafe>=2.1.1 in /usr/local/lib/python3.10/dist-packages (from werkzeug>=0.11.15->tensorboard<2.9,>=2.8->tensorflow==2.8.0) (2.1.3)
Requirement already satisfied: pyasn1<0.6.0,>=0.4.6 in /usr/local/lib/python3.10/dist-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard<2.9,>=2.8->tensorflow==2.8.0) (0.5.0)
Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.10/dist-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.9,>=2.8->tensorflow==2.8.0) (3.2.2)
Installing collected packages: tf-estimator-nightly, tensorboard-plugin-wit, keras, tensorboard-data-server, keras-preprocessing, google-auth-oauthlib, tensorboard, tensorflow
  Attempting uninstall: keras
    Found existing installation: keras 2.13.1
    Uninstalling keras-2.13.1:
      Successfully uninstalled keras-2.13.1
  Attempting uninstall: tensorboard-data-server
    Found existing installation: tensorboard-data-server 0.7.1
    Uninstalling tensorboard-data-server-0.7.1:
      Successfully uninstalled tensorboard-data-server-0.7.1
  Attempting uninstall: google-auth-oauthlib
    Found existing installation: google-auth-oauthlib 1.0.0
    Uninstalling google-auth-oauthlib-1.0.0:
      Successfully uninstalled google-auth-oauthlib-1.0.0
  Attempting uninstall: tensorboard
    Found existing installation: tensorboard 2.13.0
    Uninstalling tensorboard-2.13.0:
      Successfully uninstalled tensorboard-2.13.0
  Attempting uninstall: tensorflow
    Found existing installation: tensorflow 2.13.0
    Uninstalling tensorflow-2.13.0:
      Successfully uninstalled tensorflow-2.13.0
Successfully installed google-auth-oauthlib-0.4.6 keras-2.8.0 keras-preprocessing-1.1.2 tensorboard-2.8.0 tensorboard-data-server-0.6.1 tensorboard-plugin-wit-1.8.1 tensorflow-2.8.0 tf-estimator-nightly-2.8.0.dev2021122109
Requirement already satisfied: keras==2.8.0 in /usr/local/lib/python3.10/dist-packages (2.8.0)

from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf
import tensorflow_hub as hub

# Helper libraries
import numpy as np
import os
from tqdm import tqdm

Download the dataset

import tensorflow_datasets as tfds
tfds.disable_progress_bar()

splits = ['train[:80%]', 'train[80%:90%]', 'train[90%:]']

(train_examples, validation_examples, test_examples), info = tfds.load('oxford_flowers102', with_info=True, as_supervised=True, split = splits, data_dir='data/')

num_examples = info.splits['train'].num_examples
num_classes = info.features['label'].num_classes

Downloading and preparing dataset 328.90 MiB (download: 328.90 MiB, generated: 331.34 MiB, total: 660.25 MiB) to data/oxford_flowers102/2.1.1...
Dataset oxford_flowers102 downloaded and prepared to data/oxford_flowers102/2.1.1. Subsequent calls will reuse this data.

Create a strategy to distribute the variables and the graph

How does tf.distribute.MirroredStrategy strategy work?

All the variables and the model graph are replicated on the replicas.
Input is evenly distributed across the replicas.
Each replica calculates the loss and gradients for the input it received.
The gradients are synced across all the replicas by summing them.
After the sync, the same update is made to the copies of the variables on each replica.

# If the list of devices is not specified in the
# `tf.distribute.MirroredStrategy` constructor, it will be auto-detected.
strategy = tf.distribute.MirroredStrategy()

print('Number of devices: {}'.format(strategy.num_replicas_in_sync))

Number of devices: 1

Setup input pipeline

Set some constants, including the buffer size, number of epochs, and the image size.

BUFFER_SIZE = num_examples
EPOCHS = 10
pixels = 224

# Path to the model features. Only use this when running the notebook on Coursera
# MODULE_HANDLE = 'data/resnet_50_feature_vector'

# Note: Uncomment the line below if you are running the notebook on Colab
MODULE_HANDLE='https://tfhub.dev/tensorflow/resnet_50/feature_vector/1'

IMAGE_SIZE = (pixels, pixels)
print("Using {} with input size {}".format(MODULE_HANDLE, IMAGE_SIZE))

Using https://tfhub.dev/tensorflow/resnet_50/feature_vector/1 with input size (224, 224)

Define a function to format the image (resizes the image and scales the pixel values to range from [0,1].

def format_image(image, label):
    image = tf.image.resize(image, IMAGE_SIZE) / 255.0
    return  image, label

Set the global batch size (please complete this section)

Given the batch size per replica and the strategy, set the global batch size.

The global batch size is the batch size per replica times the number of replicas in the strategy.

Hint: You’ll want to use the num_replicas_in_sync stored in the strategy.

# GRADED FUNCTION
def set_global_batch_size(batch_size_per_replica, strategy):
    '''
    Args:
        batch_size_per_replica (int) - batch size per replica
        strategy (tf.distribute.Strategy) - distribution strategy
    '''

    # set the global batch size
    ### START CODE HERE ###
    global_batch_size = batch_size_per_replica * strategy.num_replicas_in_sync
    ### END CODD HERE ###

    return global_batch_size

Set the GLOBAL_BATCH_SIZE with the function that you just defined

BATCH_SIZE_PER_REPLICA = 64
GLOBAL_BATCH_SIZE = set_global_batch_size(BATCH_SIZE_PER_REPLICA, strategy)

print(GLOBAL_BATCH_SIZE)

Expected Output:

Create the datasets using the global batch size and distribute the batches for training, validation and test batches

train_batches = train_examples.shuffle(num_examples // 4).map(format_image).batch(BATCH_SIZE_PER_REPLICA).prefetch(1)
validation_batches = validation_examples.map(format_image).batch(BATCH_SIZE_PER_REPLICA).prefetch(1)
test_batches = test_examples.map(format_image).batch(1)

Define the distributed datasets (please complete this section)

Create the distributed datasets using experimental_distribute_dataset() of the Strategy class and pass in the training batches.

Do the same for the validation batches and test batches.

# GRADED FUNCTION
def distribute_datasets(strategy, train_batches, validation_batches, test_batches):

    ### START CODE HERE ###
    train_dist_dataset = strategy.experimental_distribute_dataset(train_batches)
    val_dist_dataset = strategy.experimental_distribute_dataset(validation_batches)
    test_dist_dataset = strategy.experimental_distribute_dataset(test_batches)
    ### END CODE HERE ###

    return train_dist_dataset, val_dist_dataset, test_dist_dataset

Call the function that you just defined to get the distributed datasets.

train_dist_dataset, val_dist_dataset, test_dist_dataset = distribute_datasets(strategy, train_batches, validation_batches, test_batches)

Take a look at the types of the distributed datasets:

print(type(train_dist_dataset))
print(type(val_dist_dataset))
print(type(test_dist_dataset))

<class 'tensorflow.python.distribute.input_lib.DistributedDataset'>
<class 'tensorflow.python.distribute.input_lib.DistributedDataset'>
<class 'tensorflow.python.distribute.input_lib.DistributedDataset'>

Expected Output:

<class 'tensorflow.python.distribute.input_lib.DistributedDataset'>
<class 'tensorflow.python.distribute.input_lib.DistributedDataset'>
<class 'tensorflow.python.distribute.input_lib.DistributedDataset'>

Also get familiar with a single batch from the train_dist_dataset:

Each batch has 64 features and labels

# Take a look at a single batch from the train_dist_dataset
x = iter(train_dist_dataset).get_next()

print(f"x is a tuple that contains {len(x)} values ")
print(f"x[0] contains the features, and has shape {x[0].shape}")
print(f"  so it has {x[0].shape[0]} examples in the batch, each is an image that is {x[0].shape[1:]}")
print(f"x[1] contains the labels, and has shape {x[1].shape}")

x is a tuple that contains 2 values 
x[0] contains the features, and has shape (64, 224, 224, 3)
  so it has 64 examples in the batch, each is an image that is (224, 224, 3)
x[1] contains the labels, and has shape (64,)

Create the model

Use the Model Subclassing API to create model ResNetModel as a subclass of tf.keras.Model.

class ResNetModel(tf.keras.Model):
    def __init__(self, classes):
        super(ResNetModel, self).__init__()
        self._feature_extractor = hub.KerasLayer(MODULE_HANDLE,
                                                 trainable=False)
        self._classifier = tf.keras.layers.Dense(classes, activation='softmax')

    def call(self, inputs):
        x = self._feature_extractor(inputs)
        x = self._classifier(x)
        return x

Create a checkpoint directory to store the checkpoints (the model’s weights during training).

# Create a checkpoint directory to store the checkpoints.
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")

Define the loss function

You’ll define the loss_object and compute_loss within the strategy.scope().

loss_object will be used later to calculate the loss on the test set.
compute_loss will be used later to calculate the average loss on the training data.

You will be using these two loss calculations later.

with strategy.scope():
    # Set reduction to `NONE` so we can do the reduction afterwards and divide by
    # the global batch size.
    loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
        reduction=tf.keras.losses.Reduction.NONE)
    # or loss_fn = tf.keras.losses.sparse_categorical_crossentropy
    def compute_loss(labels, predictions):
        per_example_loss = loss_object(labels, predictions)
        return tf.nn.compute_average_loss(per_example_loss, global_batch_size=GLOBAL_BATCH_SIZE)

    test_loss = tf.keras.metrics.Mean(name='test_loss')

Define the metrics to track loss and accuracy

These metrics track the test loss and training and test accuracy.

You can use .result() to get the accumulated statistics at any time, for example, train_accuracy.result().

with strategy.scope():
    train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
        name='train_accuracy')
    test_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
        name='test_accuracy')

Instantiate the model, optimizer, and checkpoints

This code is given to you. Just remember that they are created within the strategy.scope().

Instantiate the ResNetModel, passing in the number of classes
Create an instance of the Adam optimizer.
Create a checkpoint for this model and its optimizer.

Note: If you are running this on Colab and get the error message: OSError: data/resnet_50_feature_vector does not exist, please scroll up to the Setup Input Pipeline section and uncomment the MODULE_HANDLE line for Colab. Then restart the runtime and run all cells.

# model and optimizer must be created under `strategy.scope`.
with strategy.scope():
    model = ResNetModel(classes=num_classes)
    optimizer = tf.keras.optimizers.Adam()
    checkpoint = tf.train.Checkpoint(optimizer=optimizer, model=model)

Training loop (please complete this section)

You will define a regular training step and test step, which could work without a distributed strategy. You can then use strategy.run to apply these functions in a distributed manner.

Notice that you’ll define train_step and test_step inside another function train_testp_step_fns, which will then return these two functions.

Define train_step

Within the strategy’s scope, define train_step(inputs)

inputs will be a tuple containing (images, labels).
Create a gradient tape block.
Within the gradient tape block:
- Call the model, passing in the images and setting training to be True (complete this part).
- Call the compute_loss function (defined earlier) to compute the training loss (complete this part).
- Use the gradient tape to calculate the gradients.
- Use the optimizer to update the weights using the gradients.

Define test_step

Also within the strategy’s scope, define test_step(inputs)

inputs is a tuple containing (images, labels).
- Call the model, passing in the images and set training to False, because the model is not going to train on the test data. (complete this part).
- Use the loss_object, which will compute the test loss. Check compute_loss, defined earlier, to see what parameters to pass into loss_object. (complete this part).
- Next, update test_loss (the running test loss) with the t_loss (the loss for the current batch).
- Also update the test_accuracy.

# GRADED FUNCTION
def train_test_step_fns(strategy, model, compute_loss, optimizer, train_accuracy, loss_object, test_loss, test_accuracy):
    with strategy.scope():
        def train_step(inputs):
            images, labels = inputs

            with tf.GradientTape() as tape:
                ### START CODE HERE ###
                predictions = model(images, training=True)
                loss = compute_loss(labels, predictions)
                ### END CODE HERE ###

            gradients = tape.gradient(loss, model.trainable_variables)
            optimizer.apply_gradients(zip(gradients, model.trainable_variables))

            train_accuracy.update_state(labels, predictions)
            return loss

        def test_step(inputs):
            images, labels = inputs

            ### START CODE HERE ###
            predictions = model(images, training=False)
            t_loss = loss_object(labels, predictions)
            ### END CODE HERE ###

            test_loss.update_state(t_loss)
            test_accuracy.update_state(labels, predictions)

        return train_step, test_step

Use the train_test_step_fns function to produce the train_step and test_step functions.

train_step, test_step = train_test_step_fns(strategy, model, compute_loss, optimizer, train_accuracy, loss_object, test_loss, test_accuracy)

Distributed training and testing (please complete this section)

The train_step and test_step could be used in a non-distributed, regular model training. To apply them in a distributed way, you’ll use strategy.run.

distributed_train_step

Call the run function of the strategy, passing in the train step function (which you defined earlier), as well as the arguments that go in the train step function.
The run function is defined like this run(fn, args=() ).
- args will take in the dataset inputs

distributed_test_step

Similar to training, the distributed test step will use the run function of your strategy, taking in the test step function as well as the dataset inputs that go into the test step function.

Hint:

You saw earlier that each batch in train_dist_dataset is tuple with two values:
- a batch of features
- a batch of labels.

Let’s think about how you’ll want to pass in the dataset inputs into args by running this next cell of code:

#See various ways of passing in the inputs

def fun1(args=()):
    print(f"number of arguments passed is {len(args)}")


list_of_inputs = [1,2]
print("When passing in args=list_of_inputs:")
fun1(args=list_of_inputs)
print()
print("When passing in args=(list_of_inputs)")
fun1(args=(list_of_inputs))
print()
print("When passing in args=(list_of_inputs,)")
fun1(args=(list_of_inputs,))

When passing in args=list_of_inputs:
number of arguments passed is 2

When passing in args=(list_of_inputs)
number of arguments passed is 2

When passing in args=(list_of_inputs,)
number of arguments passed is 1

Notice that depending on how list_of_inputs is passed to args affects whether fun1 sees one or two positional arguments.

If you see an error message about positional arguments when running the training code later, please come back to check how you’re passing in the inputs to run.

Please complete the following function.

def distributed_train_test_step_fns(strategy, train_step, test_step, model, compute_loss, optimizer, train_accuracy, loss_object, test_loss, test_accuracy):
    with strategy.scope():
        @tf.function
        def distributed_train_step(dataset_inputs):
            ### START CODE HERE ###
            per_replica_losses = strategy.run(train_step, args=(dataset_inputs,))
            ### END CODE HERE ###
            return strategy.reduce(tf.distribute.ReduceOp.SUM, per_replica_losses,
                                   axis=None)

        @tf.function
        def distributed_test_step(dataset_inputs):
            ### START CODE HERE ###
            return strategy.run(test_step, args=(dataset_inputs,))
            ### END CODE HERE ###

        return distributed_train_step, distributed_test_step

Call the function that you just defined to get the distributed train step function and distributed test step function.

distributed_train_step, distributed_test_step = distributed_train_test_step_fns(strategy, train_step, test_step, model, compute_loss, optimizer, train_accuracy, loss_object, test_loss, test_accuracy)

An important note before you continue:

The following sections will guide you through how to train your model and save it to a .zip file. These sections are not required for you to pass this assignment but you are encouraged to continue anyway. If you consider no more work is needed in previous sections, please submit now and carry on.

After training your model, you can download it as a .zip file and upload it back to the platform to know how well it performed. However, training your model takes around 20 minutes within the Coursera environment. Because of this, there are two methods to train your model:

Method 1

If 20 mins is too long for you, we recommend to download this notebook (after submitting it for grading) and upload it to Colab to finish the training in a GPU-enabled runtime. If you decide to do this, these are the steps to follow:

Save this notebok.
Click the jupyter logo on the upper left corner of the window. This will take you to the Jupyter workspace.
Select this notebook (C2W4_Assignment.ipynb) and click Shutdown.
Once the notebook is shutdown, you can go ahead and download it.
Head over to Colab and select the upload tab and upload your notebook.
Before running any cell go into Runtime –> Change Runtime Type and make sure that GPU is enabled.
Uncomment the first line in this notebook that installs autograder-compatible package versions.
Uncomment the MODULE_HANDLE in the Setup input pipeline section that contains the URL to the feature vector.
Run all of the cells in the notebook. After training, follow the rest of the instructions of the notebook to download your model.

Method 2

If you prefer to wait the 20 minutes and not leave Coursera, keep going through this notebook. Once you are done, follow these steps:

Click the jupyter logo on the upper left corner of the window. This will take you to the jupyter filesystem.
In the filesystem you should see a file named mymodel.zip. Go ahead and download it.

Independent of the method you choose, you should end up with a mymodel.zip file which can be uploaded for evaluation after this assignment. Once again, this is optional but we strongly encourage you to do it as it is a lot of fun.

With this out of the way, let’s continue.

Run the distributed training in a loop

You’ll now use a for-loop to go through the desired number of epochs and train the model in a distributed manner. In each epoch:

Loop through each distributed training set
- For each training batch, call distributed_train_step and get the loss.
After going through all training batches, calculate the training loss as the average of the batch losses.
Loop through each batch of the distributed test set.
- For each test batch, run the distributed test step. The test loss and test accuracy are updated within the test step function.
Print the epoch number, training loss, training accuracy, test loss and test accuracy.
Reset the losses and accuracies before continuing to another epoch.

# Running this cell in Coursera takes around 20 mins
with strategy.scope():
    for epoch in range(EPOCHS):
        # TRAIN LOOP
        total_loss = 0.0
        num_batches = 0
        for x in tqdm(train_dist_dataset):
            total_loss += distributed_train_step(x)
            num_batches += 1
        train_loss = total_loss / num_batches

        # TEST LOOP
        for x in test_dist_dataset:
            distributed_test_step(x)

        template = ("Epoch {}, Loss: {}, Accuracy: {}, Test Loss: {}, "
                    "Test Accuracy: {}")
        print (template.format(epoch+1, train_loss,
                               train_accuracy.result()*100, test_loss.result(),
                               test_accuracy.result()*100))

        test_loss.reset_states()
        train_accuracy.reset_states()
        test_accuracy.reset_states()

13it [00:26,  2.07s/it]


Epoch 1, Loss: 4.624648094177246, Accuracy: 5.392157077789307, Test Loss: 3.8992528915405273, Test Accuracy: 11.764705657958984


13it [00:02,  4.43it/s]


Epoch 2, Loss: 2.5391149520874023, Accuracy: 54.16666793823242, Test Loss: 2.7988758087158203, Test Accuracy: 43.13725662231445


13it [00:02,  4.44it/s]


Epoch 3, Loss: 1.4021223783493042, Accuracy: 85.17156982421875, Test Loss: 2.218491315841675, Test Accuracy: 51.960784912109375


13it [00:05,  2.51it/s]


Epoch 4, Loss: 0.8167105317115784, Accuracy: 95.22058868408203, Test Loss: 1.8021492958068848, Test Accuracy: 60.78431701660156


13it [00:02,  4.47it/s]


Epoch 5, Loss: 0.5271885991096497, Accuracy: 97.30392456054688, Test Loss: 1.6222103834152222, Test Accuracy: 64.70588684082031


13it [00:02,  4.45it/s]


Epoch 6, Loss: 0.3695581555366516, Accuracy: 98.28431701660156, Test Loss: 1.4777246713638306, Test Accuracy: 65.68627166748047


13it [00:03,  3.49it/s]


Epoch 7, Loss: 0.2741614282131195, Accuracy: 99.01960754394531, Test Loss: 1.4114176034927368, Test Accuracy: 68.62745666503906


13it [00:03,  4.33it/s]


Epoch 8, Loss: 0.21502599120140076, Accuracy: 99.63235473632812, Test Loss: 1.3599189519882202, Test Accuracy: 70.5882339477539


13it [00:03,  4.33it/s]


Epoch 9, Loss: 0.1721489429473877, Accuracy: 99.75489807128906, Test Loss: 1.3026074171066284, Test Accuracy: 71.5686264038086


13it [00:03,  3.62it/s]


Epoch 10, Loss: 0.14168773591518402, Accuracy: 99.75489807128906, Test Loss: 1.2739832401275635, Test Accuracy: 70.5882339477539

Things to note in the example above:

We are iterating over the train_dist_dataset and test_dist_dataset using a for x in ... construct.
The scaled loss is the return value of the distributed_train_step. This value is aggregated across replicas using the tf.distribute.Strategy.reduce call and then across batches by summing the return value of the tf.distribute.Strategy.reduce calls.
tf.keras.Metrics should be updated inside train_step and test_step that gets executed by tf.distribute.Strategy.experimental_run_v2. *tf.distribute.Strategy.experimental_run_v2 returns results from each local replica in the strategy, and there are multiple ways to consume this result. You can do tf.distribute.Strategy.reduce to get an aggregated value. You can also do tf.distribute.Strategy.experimental_local_results to get the list of values contained in the result, one per local replica.

Save the Model for submission (Optional)

You’ll get a saved model of this trained model. You’ll then need to zip that to upload it to the testing infrastructure. We provide the code to help you with that here:

Step 1: Save the model as a SavedModel

This code will save your model as a SavedModel

model_save_path = "./tmp/mymodel/1/"
tf.saved_model.save(model, model_save_path)

Step 2: Zip the SavedModel Directory into /mymodel.zip

This code will zip your saved model directory contents into a single file.

If you are on colab, you can use the file browser pane to the left of colab to find mymodel.zip. Right click on it and select ‘Download’.

If the download fails because you aren’t allowed to download multiple files from colab, check out the guidance here: https://ccm.net/faq/32938-google-chrome-allow-websites-to-perform-simultaneous-downloads

If you are in Coursera, follow the instructions previously provided.

It’s a large file, so it might take some time to download.

import os
import zipfile

def zipdir(path, ziph):
    # ziph is zipfile handle
    for root, dirs, files in os.walk(path):
        for file in files:
            ziph.write(os.path.join(root, file))

zipf = zipfile.ZipFile('./mymodel.zip', 'w', zipfile.ZIP_DEFLATED)
zipdir('./tmp/mymodel/1/', zipf)
zipf.close()