Coursera

Learning TensorFlow: Convolutional Neural Networks

Overview

This tutorial demonstrates how to

Train a custom Convolutional Neural Network(CNN) for performing image classification on Vertex AI using TensorFlow.
Deploy the trained model for online prediction using Vertex AI.

Dataset

You will train a Convolutional Neural Network(CNN) to classify images of clothing from a dataset called Fashion MNIST.

This dataset contains 70,000 items of clothing belonging to 10 different categories of clothing. The images show individual articles of clothing at low resolution (28 by 28 pixels), as seen here:

In this notebook, you will use 60,000 images to train the network and 10,000 images to evaluate how accurately the network learned to classify the images.

The Fashion MNIST data is available in tensorflow datasets(tfds).

Objective

In this notebook, you will create a custom-trained CNN model from a Python script in a Docker container using the Vertex SDK for Python, and then do a prediction on the deployed model by sending data. Alternatively, you can create custom-trained models using gcloud command-line tool, or online using the Cloud Console.

The steps performed include:

Create a Vertex AI custom job and train a CNN model.
Deploy the Model resource to a serving Endpoint resource.
Make a prediction.
Undeploy the Model resource.

Setting up the environment

!pip install google-cloud-aiplatform
!pip install --user tensorflow-text
!pip install --user tensorflow-datasets
!pip install protobuf==3.20.1

Requirement already satisfied: google-cloud-aiplatform in /opt/conda/lib/python3.10/site-packages (1.43.0)
Requirement already satisfied: google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.34.1 in /opt/conda/lib/python3.10/site-packages (from google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.34.1->google-cloud-aiplatform) (1.34.1)
Requirement already satisfied: google-auth<3.0.0dev,>=2.14.1 in /opt/conda/lib/python3.10/site-packages (from google-cloud-aiplatform) (2.28.1)
Requirement already satisfied: proto-plus<2.0.0dev,>=1.22.0 in /opt/conda/lib/python3.10/site-packages (from google-cloud-aiplatform) (1.23.0)
Requirement already satisfied: protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5 in /opt/conda/lib/python3.10/site-packages (from google-cloud-aiplatform) (3.20.3)
Requirement already satisfied: packaging>=14.3 in /opt/conda/lib/python3.10/site-packages (from google-cloud-aiplatform) (23.2)
Requirement already satisfied: google-cloud-storage<3.0.0dev,>=1.32.0 in /opt/conda/lib/python3.10/site-packages (from google-cloud-aiplatform) (2.14.0)
Requirement already satisfied: google-cloud-bigquery<4.0.0dev,>=1.15.0 in /opt/conda/lib/python3.10/site-packages (from google-cloud-aiplatform) (3.18.0)
Requirement already satisfied: google-cloud-resource-manager<3.0.0dev,>=1.3.3 in /opt/conda/lib/python3.10/site-packages (from google-cloud-aiplatform) (1.12.3)
Requirement already satisfied: shapely<3.0.0dev in /opt/conda/lib/python3.10/site-packages (from google-cloud-aiplatform) (2.0.3)
Requirement already satisfied: googleapis-common-protos<2.0dev,>=1.56.2 in /opt/conda/lib/python3.10/site-packages (from google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.34.1->google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.34.1->google-cloud-aiplatform) (1.62.0)
Requirement already satisfied: requests<3.0.0dev,>=2.18.0 in /opt/conda/lib/python3.10/site-packages (from google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.34.1->google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.34.1->google-cloud-aiplatform) (2.31.0)
Requirement already satisfied: grpcio<2.0dev,>=1.33.2 in /opt/conda/lib/python3.10/site-packages (from google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.34.1->google-cloud-aiplatform) (1.62.0)
Requirement already satisfied: grpcio-status<2.0dev,>=1.33.2 in /opt/conda/lib/python3.10/site-packages (from google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.34.1->google-cloud-aiplatform) (1.48.2)
Requirement already satisfied: cachetools<6.0,>=2.0.0 in /opt/conda/lib/python3.10/site-packages (from google-auth<3.0.0dev,>=2.14.1->google-cloud-aiplatform) (5.3.3)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /opt/conda/lib/python3.10/site-packages (from google-auth<3.0.0dev,>=2.14.1->google-cloud-aiplatform) (0.3.0)
Requirement already satisfied: rsa<5,>=3.1.4 in /opt/conda/lib/python3.10/site-packages (from google-auth<3.0.0dev,>=2.14.1->google-cloud-aiplatform) (4.9)
Requirement already satisfied: google-cloud-core<3.0.0dev,>=1.6.0 in /opt/conda/lib/python3.10/site-packages (from google-cloud-bigquery<4.0.0dev,>=1.15.0->google-cloud-aiplatform) (2.4.1)
Requirement already satisfied: google-resumable-media<3.0dev,>=0.6.0 in /opt/conda/lib/python3.10/site-packages (from google-cloud-bigquery<4.0.0dev,>=1.15.0->google-cloud-aiplatform) (2.7.0)
Requirement already satisfied: python-dateutil<3.0dev,>=2.7.2 in /opt/conda/lib/python3.10/site-packages (from google-cloud-bigquery<4.0.0dev,>=1.15.0->google-cloud-aiplatform) (2.9.0)
Requirement already satisfied: grpc-google-iam-v1<1.0.0dev,>=0.12.4 in /opt/conda/lib/python3.10/site-packages (from google-cloud-resource-manager<3.0.0dev,>=1.3.3->google-cloud-aiplatform) (0.13.0)
Requirement already satisfied: google-crc32c<2.0dev,>=1.0 in /opt/conda/lib/python3.10/site-packages (from google-cloud-storage<3.0.0dev,>=1.32.0->google-cloud-aiplatform) (1.5.0)
Requirement already satisfied: numpy<2,>=1.14 in /opt/conda/lib/python3.10/site-packages (from shapely<3.0.0dev->google-cloud-aiplatform) (1.25.2)
Requirement already satisfied: pyasn1<0.6.0,>=0.4.6 in /opt/conda/lib/python3.10/site-packages (from pyasn1-modules>=0.2.1->google-auth<3.0.0dev,>=2.14.1->google-cloud-aiplatform) (0.5.1)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.10/site-packages (from python-dateutil<3.0dev,>=2.7.2->google-cloud-bigquery<4.0.0dev,>=1.15.0->google-cloud-aiplatform) (1.16.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests<3.0.0dev,>=2.18.0->google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.34.1->google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.34.1->google-cloud-aiplatform) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests<3.0.0dev,>=2.18.0->google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.34.1->google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.34.1->google-cloud-aiplatform) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/conda/lib/python3.10/site-packages (from requests<3.0.0dev,>=2.18.0->google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.34.1->google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.34.1->google-cloud-aiplatform) (1.26.18)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests<3.0.0dev,>=2.18.0->google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.34.1->google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.34.1->google-cloud-aiplatform) (2024.2.2)
Collecting tensorflow-text
  Downloading tensorflow_text-2.16.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.0 kB)
Collecting tensorflow<2.17,>=2.16.1 (from tensorflow-text)
  Downloading tensorflow-2.16.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.3 kB)
Requirement already satisfied: absl-py>=1.0.0 in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.17,>=2.16.1->tensorflow-text) (2.1.0)
Collecting astunparse>=1.6.0 (from tensorflow<2.17,>=2.16.1->tensorflow-text)
  Downloading astunparse-1.6.3-py2.py3-none-any.whl.metadata (4.4 kB)
Collecting flatbuffers>=23.5.26 (from tensorflow<2.17,>=2.16.1->tensorflow-text)
  Downloading flatbuffers-24.3.7-py2.py3-none-any.whl.metadata (849 bytes)
Collecting gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 (from tensorflow<2.17,>=2.16.1->tensorflow-text)
  Downloading gast-0.5.4-py3-none-any.whl.metadata (1.3 kB)
Collecting google-pasta>=0.1.1 (from tensorflow<2.17,>=2.16.1->tensorflow-text)
  Downloading google_pasta-0.2.0-py3-none-any.whl.metadata (814 bytes)
Collecting h5py>=3.10.0 (from tensorflow<2.17,>=2.16.1->tensorflow-text)
  Downloading h5py-3.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.5 kB)
Collecting libclang>=13.0.0 (from tensorflow<2.17,>=2.16.1->tensorflow-text)
  Downloading libclang-18.1.1-py2.py3-none-manylinux2010_x86_64.whl.metadata (5.2 kB)
Collecting ml-dtypes~=0.3.1 (from tensorflow<2.17,>=2.16.1->tensorflow-text)
  Downloading ml_dtypes-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (20 kB)
Collecting opt-einsum>=2.3.2 (from tensorflow<2.17,>=2.16.1->tensorflow-text)
  Downloading opt_einsum-3.3.0-py3-none-any.whl.metadata (6.5 kB)
Requirement already satisfied: packaging in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.17,>=2.16.1->tensorflow-text) (23.2)
Requirement already satisfied: protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3 in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.17,>=2.16.1->tensorflow-text) (3.20.3)
Requirement already satisfied: requests<3,>=2.21.0 in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.17,>=2.16.1->tensorflow-text) (2.31.0)
Requirement already satisfied: setuptools in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.17,>=2.16.1->tensorflow-text) (69.1.1)
Requirement already satisfied: six>=1.12.0 in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.17,>=2.16.1->tensorflow-text) (1.16.0)
Collecting termcolor>=1.1.0 (from tensorflow<2.17,>=2.16.1->tensorflow-text)
  Downloading termcolor-2.4.0-py3-none-any.whl.metadata (6.1 kB)
Requirement already satisfied: typing-extensions>=3.6.6 in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.17,>=2.16.1->tensorflow-text) (4.10.0)
Requirement already satisfied: wrapt>=1.11.0 in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.17,>=2.16.1->tensorflow-text) (1.16.0)
Requirement already satisfied: grpcio<2.0,>=1.24.3 in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.17,>=2.16.1->tensorflow-text) (1.62.0)
Collecting tensorboard<2.17,>=2.16 (from tensorflow<2.17,>=2.16.1->tensorflow-text)
  Downloading tensorboard-2.16.2-py3-none-any.whl.metadata (1.6 kB)
Collecting keras>=3.0.0 (from tensorflow<2.17,>=2.16.1->tensorflow-text)
  Downloading keras-3.1.1-py3-none-any.whl.metadata (5.6 kB)
Collecting tensorflow-io-gcs-filesystem>=0.23.1 (from tensorflow<2.17,>=2.16.1->tensorflow-text)
  Downloading tensorflow_io_gcs_filesystem-0.36.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (14 kB)
Requirement already satisfied: numpy<2.0.0,>=1.23.5 in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.17,>=2.16.1->tensorflow-text) (1.25.2)
Requirement already satisfied: wheel<1.0,>=0.23.0 in /opt/conda/lib/python3.10/site-packages (from astunparse>=1.6.0->tensorflow<2.17,>=2.16.1->tensorflow-text) (0.42.0)
Requirement already satisfied: rich in /opt/conda/lib/python3.10/site-packages (from keras>=3.0.0->tensorflow<2.17,>=2.16.1->tensorflow-text) (13.7.1)
Collecting namex (from keras>=3.0.0->tensorflow<2.17,>=2.16.1->tensorflow-text)
  Downloading namex-0.0.7-py3-none-any.whl.metadata (246 bytes)
Collecting optree (from keras>=3.0.0->tensorflow<2.17,>=2.16.1->tensorflow-text)
  Downloading optree-0.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (45 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.3/45.3 kB[0m [31m190.7 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hRequirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests<3,>=2.21.0->tensorflow<2.17,>=2.16.1->tensorflow-text) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests<3,>=2.21.0->tensorflow<2.17,>=2.16.1->tensorflow-text) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/conda/lib/python3.10/site-packages (from requests<3,>=2.21.0->tensorflow<2.17,>=2.16.1->tensorflow-text) (1.26.18)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests<3,>=2.21.0->tensorflow<2.17,>=2.16.1->tensorflow-text) (2024.2.2)
Collecting markdown>=2.6.8 (from tensorboard<2.17,>=2.16->tensorflow<2.17,>=2.16.1->tensorflow-text)
  Downloading Markdown-3.6-py3-none-any.whl.metadata (7.0 kB)
Collecting tensorboard-data-server<0.8.0,>=0.7.0 (from tensorboard<2.17,>=2.16->tensorflow<2.17,>=2.16.1->tensorflow-text)
  Downloading tensorboard_data_server-0.7.2-py3-none-manylinux_2_31_x86_64.whl.metadata (1.1 kB)
Collecting werkzeug>=1.0.1 (from tensorboard<2.17,>=2.16->tensorflow<2.17,>=2.16.1->tensorflow-text)
  Downloading werkzeug-3.0.1-py3-none-any.whl.metadata (4.1 kB)
Requirement already satisfied: MarkupSafe>=2.1.1 in /opt/conda/lib/python3.10/site-packages (from werkzeug>=1.0.1->tensorboard<2.17,>=2.16->tensorflow<2.17,>=2.16.1->tensorflow-text) (2.1.5)
Requirement already satisfied: markdown-it-py>=2.2.0 in /opt/conda/lib/python3.10/site-packages (from rich->keras>=3.0.0->tensorflow<2.17,>=2.16.1->tensorflow-text) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /opt/conda/lib/python3.10/site-packages (from rich->keras>=3.0.0->tensorflow<2.17,>=2.16.1->tensorflow-text) (2.17.2)
Requirement already satisfied: mdurl~=0.1 in /opt/conda/lib/python3.10/site-packages (from markdown-it-py>=2.2.0->rich->keras>=3.0.0->tensorflow<2.17,>=2.16.1->tensorflow-text) (0.1.2)
Downloading tensorflow_text-2.16.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m0m
[?25hDownloading tensorflow-2.16.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (589.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m589.8/589.8 MB[0m [31m865.4 kB/s[0m eta [36m0:00:00[0m0:01[0m00:07[0m
[?25hDownloading astunparse-1.6.3-py2.py3-none-any.whl (12 kB)
Downloading flatbuffers-24.3.7-py2.py3-none-any.whl (26 kB)
Downloading gast-0.5.4-py3-none-any.whl (19 kB)
Downloading google_pasta-0.2.0-py3-none-any.whl (57 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.5/57.5 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading h5py-3.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.8/4.8 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading keras-3.1.1-py3-none-any.whl (1.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading libclang-18.1.1-py2.py3-none-manylinux2010_x86_64.whl (24.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.5/24.5 MB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading ml_dtypes-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading opt_einsum-3.3.0-py3-none-any.whl (65 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.5/65.5 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading tensorboard-2.16.2-py3-none-any.whl (5.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.5/5.5 MB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading tensorflow_io_gcs_filesystem-0.36.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.1/5.1 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading termcolor-2.4.0-py3-none-any.whl (7.7 kB)
Downloading Markdown-3.6-py3-none-any.whl (105 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.4/105.4 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hDownloading tensorboard_data_server-0.7.2-py3-none-manylinux_2_31_x86_64.whl (6.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.6/6.6 MB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading werkzeug-3.0.1-py3-none-any.whl (226 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m226.7/226.7 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading namex-0.0.7-py3-none-any.whl (5.8 kB)
Downloading optree-0.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (286 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m286.8/286.8 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: namex, libclang, flatbuffers, werkzeug, termcolor, tensorflow-io-gcs-filesystem, tensorboard-data-server, optree, opt-einsum, ml-dtypes, markdown, h5py, google-pasta, gast, astunparse, tensorboard, keras, tensorflow, tensorflow-text
[33m  WARNING: The script markdown_py is installed in '/home/jupyter/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.[0m[33m
[0m[33m  WARNING: The script tensorboard is installed in '/home/jupyter/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.[0m[33m
[0m[33m  WARNING: The scripts import_pb_to_tensorboard, saved_model_cli, tensorboard, tf_upgrade_v2, tflite_convert, toco and toco_from_protos are installed in '/home/jupyter/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.[0m[33m
[0mSuccessfully installed astunparse-1.6.3 flatbuffers-24.3.7 gast-0.5.4 google-pasta-0.2.0 h5py-3.10.0 keras-3.1.1 libclang-18.1.1 markdown-3.6 ml-dtypes-0.3.2 namex-0.0.7 opt-einsum-3.3.0 optree-0.10.0 tensorboard-2.16.2 tensorboard-data-server-0.7.2 tensorflow-2.16.1 tensorflow-io-gcs-filesystem-0.36.0 tensorflow-text-2.16.1 termcolor-2.4.0 werkzeug-3.0.1
Collecting tensorflow-datasets
  Downloading tensorflow_datasets-4.9.4-py3-none-any.whl.metadata (9.2 kB)
Requirement already satisfied: absl-py in /opt/conda/lib/python3.10/site-packages (from tensorflow-datasets) (2.1.0)
Requirement already satisfied: click in /opt/conda/lib/python3.10/site-packages (from tensorflow-datasets) (8.1.7)
Requirement already satisfied: dm-tree in /opt/conda/lib/python3.10/site-packages (from tensorflow-datasets) (0.1.8)
Collecting etils>=0.9.0 (from etils[enp,epath,etree]>=0.9.0->tensorflow-datasets)
  Downloading etils-1.7.0-py3-none-any.whl.metadata (6.4 kB)
Requirement already satisfied: numpy in /opt/conda/lib/python3.10/site-packages (from tensorflow-datasets) (1.25.2)
Collecting promise (from tensorflow-datasets)
  Downloading promise-2.3.tar.gz (19 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hRequirement already satisfied: protobuf>=3.20 in /opt/conda/lib/python3.10/site-packages (from tensorflow-datasets) (3.20.3)
Requirement already satisfied: psutil in /opt/conda/lib/python3.10/site-packages (from tensorflow-datasets) (5.9.3)
Requirement already satisfied: requests>=2.19.0 in /opt/conda/lib/python3.10/site-packages (from tensorflow-datasets) (2.31.0)
Collecting tensorflow-metadata (from tensorflow-datasets)
  Downloading tensorflow_metadata-1.14.0-py3-none-any.whl.metadata (2.1 kB)
Requirement already satisfied: termcolor in /home/jupyter/.local/lib/python3.10/site-packages (from tensorflow-datasets) (2.4.0)
Requirement already satisfied: toml in /opt/conda/lib/python3.10/site-packages (from tensorflow-datasets) (0.10.2)
Requirement already satisfied: tqdm in /opt/conda/lib/python3.10/site-packages (from tensorflow-datasets) (4.66.2)
Requirement already satisfied: wrapt in /opt/conda/lib/python3.10/site-packages (from tensorflow-datasets) (1.16.0)
Collecting array-record>=0.5.0 (from tensorflow-datasets)
  Downloading array_record-0.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (503 bytes)
Requirement already satisfied: fsspec in /opt/conda/lib/python3.10/site-packages (from etils[enp,epath,etree]>=0.9.0->tensorflow-datasets) (2024.2.0)
Requirement already satisfied: importlib_resources in /opt/conda/lib/python3.10/site-packages (from etils[enp,epath,etree]>=0.9.0->tensorflow-datasets) (6.1.2)
Requirement already satisfied: typing_extensions in /opt/conda/lib/python3.10/site-packages (from etils[enp,epath,etree]>=0.9.0->tensorflow-datasets) (4.10.0)
Requirement already satisfied: zipp in /opt/conda/lib/python3.10/site-packages (from etils[enp,epath,etree]>=0.9.0->tensorflow-datasets) (3.17.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests>=2.19.0->tensorflow-datasets) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests>=2.19.0->tensorflow-datasets) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/conda/lib/python3.10/site-packages (from requests>=2.19.0->tensorflow-datasets) (1.26.18)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests>=2.19.0->tensorflow-datasets) (2024.2.2)
Requirement already satisfied: six in /opt/conda/lib/python3.10/site-packages (from promise->tensorflow-datasets) (1.16.0)
Collecting absl-py (from tensorflow-datasets)
  Downloading absl_py-1.4.0-py3-none-any.whl.metadata (2.3 kB)
Requirement already satisfied: googleapis-common-protos<2,>=1.52.0 in /opt/conda/lib/python3.10/site-packages (from tensorflow-metadata->tensorflow-datasets) (1.62.0)
Downloading tensorflow_datasets-4.9.4-py3-none-any.whl (5.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.1/5.1 MB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading array_record-0.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading etils-1.7.0-py3-none-any.whl (152 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m152.4/152.4 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tensorflow_metadata-1.14.0-py3-none-any.whl (28 kB)
Downloading absl_py-1.4.0-py3-none-any.whl (126 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.5/126.5 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: promise
  Building wheel for promise (setup.py) ... [?25ldone
[?25h  Created wheel for promise: filename=promise-2.3-py3-none-any.whl size=21483 sha256=5a82f3e23740d4a8b542b0fdd6939c2c2dce6d3dbc0ae817bd1c58970cc60086
  Stored in directory: /home/jupyter/.cache/pip/wheels/54/4e/28/3ed0e1c8a752867445bab994d2340724928aa3ab059c57c8db
Successfully built promise
Installing collected packages: promise, etils, absl-py, tensorflow-metadata, array-record, tensorflow-datasets
[33m  WARNING: The script tfds is installed in '/home/jupyter/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.[0m[33m
[0mSuccessfully installed absl-py-1.4.0 array-record-0.5.0 etils-1.7.0 promise-2.3 tensorflow-datasets-4.9.4 tensorflow-metadata-1.14.0
Collecting protobuf==3.20.1
  Downloading protobuf-3.20.1-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (698 bytes)
Downloading protobuf-3.20.1-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: protobuf
  Attempting uninstall: protobuf
    Found existing installation: protobuf 3.20.3
    Uninstalling protobuf-3.20.3:
      Successfully uninstalled protobuf-3.20.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.16.1 requires protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3, but you have protobuf 3.20.1 which is incompatible.
tensorflow-metadata 1.14.0 requires protobuf<4.21,>=3.20.3, but you have protobuf 3.20.1 which is incompatible.
google-api-core 1.34.1 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<4.0.0dev,>=3.19.5, but you have protobuf 3.20.1 which is incompatible.
google-cloud-aiplatform 1.43.0 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.20.1 which is incompatible.
google-cloud-artifact-registry 1.11.3 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.20.1 which is incompatible.
google-cloud-bigquery-storage 2.24.0 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.20.1 which is incompatible.
google-cloud-language 2.13.3 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.20.1 which is incompatible.
google-cloud-monitoring 2.19.3 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.20.1 which is incompatible.
google-cloud-resource-manager 1.12.3 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.20.1 which is incompatible.
googleapis-common-protos 1.62.0 requires protobuf!=3.20.0,!=3.20.1,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0.dev0,>=3.19.5, but you have protobuf 3.20.1 which is incompatible.
grpc-google-iam-v1 0.13.0 requires protobuf!=3.20.0,!=3.20.1,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.20.1 which is incompatible.[0m[31m
[0mSuccessfully installed protobuf-3.20.1

Restart the kernel

Once you’ve installed everything, you need to restart the notebook kernel so it can find the packages.

import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

Set your project ID

If you don’t know your project ID, you may be able to get your project ID using gcloud.

import os

PROJECT_ID = "qwiklabs-gcp-01-cb944d7414a1"

if not os.getenv("IS_TESTING"):
    # Get your Google Cloud project ID from gcloud
    shell_output=!gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID: ", PROJECT_ID)

Project ID:  qwiklabs-gcp-01-cb944d7414a1

Timestamp

If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, you create a timestamp for each instance session, and append it onto the name of resources you create in this tutorial.

from datetime import datetime

TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")

Create a Cloud Storage bucket

The following steps are required, regardless of your notebook environment.

When you submit a training job using the Cloud SDK, you upload a Python package containing your training code to a Cloud Storage bucket. Vertex AI runs the code from this package. In this tutorial, Vertex AI also saves the trained model that results from your job in the same bucket. Using this model artifact, you can then create Vertex AI model and endpoint resources in order to serve online predictions.

Set the name of your Cloud Storage bucket below. It must be unique across all Cloud Storage buckets.

You may also change the REGION variable, which is used for operations throughout the rest of this notebook. Make sure to choose a region where Vertex AI services are available. You may not use a Multi-Regional Storage bucket for training with Vertex AI (Replace the REGION with the associated region mentioned in the qwiklabs resource panel)

BUCKET_NAME = "gs://qwiklabs-gcp-01-cb944d7414a1-bucket"
REGION = "us-central1"  # @param {type:"string"}

PROJECT_ID

'qwiklabs-gcp-01-cb944d7414a1'

if BUCKET_NAME == "" or BUCKET_NAME is None or BUCKET_NAME == "gs://[your-bucket-name]":
    BUCKET_NAME = "gs://" + PROJECT_ID

Only if your bucket doesn’t already exist: Run the following cells to create your Cloud Storage bucket.

BUCKET_NAME

# ! gsutil mb -l $REGION $BUCKET_NAME

Check progress

Go back to lab manual and Click Check my progress under Create a Cloud Storage bucket section to verify that the bucket is created.

Set up variables

Next, set up some variables used throughout the tutorial.

Import Vertex SDK for Python

Import the Vertex SDK for Python into your Python environment and initialize it.

import os
import sys

from google.cloud import aiplatform
from google.cloud.aiplatform import gapic as aip

aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_NAME)

Set hardware accelerators

Here to run a container image on a CPU, we set the variables TRAIN_GPU/TRAIN_NGPU and DEPLOY_GPU/DEPLOY_NGPU to (None, None) since this notebook is meant to be run in a Qwiklab environment where GPUs cannot be provisioned.

Note: If you happen to be running this notebook from your personal GCP account, set the variables TRAIN_GPU/TRAIN_NGPU and DEPLOY_GPU/DEPLOY_NGPU to use a container image supporting a GPU and the number of GPUs allocated to the virtual machine (VM) instance. For example, to use a GPU container image with 4 Nvidia Tesla K80 GPUs allocated to each VM, you would specify:

(aip.AcceleratorType.NVIDIA_TESLA_K80, 4)

See the locations where accelerators are available.

TRAIN_GPU, TRAIN_NGPU = (None, None)
DEPLOY_GPU, DEPLOY_NGPU = (None, None)

Set pre-built containers

There are two ways you can train a custom model using a container image:

Use a Google Cloud prebuilt container. If you use a prebuilt container, you will additionally specify a Python package to install into the container image. This Python package contains your code for training a custom model.
Use your own custom container image. If you use your own container, the container needs to contain your code for training a custom model.

Here you will use pre-built containers provided by Vertex AI to run training and prediction.

For the latest list, see Pre-built containers for training and Pre-built containers for prediction

TRAIN_VERSION = "tf-cpu.2-8"
DEPLOY_VERSION = "tf2-cpu.2-8"

TRAIN_IMAGE = "us-docker.pkg.dev/vertex-ai/training/{}:latest".format(TRAIN_VERSION)
DEPLOY_IMAGE = "us-docker.pkg.dev/vertex-ai/prediction/{}:latest".format(DEPLOY_VERSION)

print("Training:", TRAIN_IMAGE, TRAIN_GPU, TRAIN_NGPU)
print("Deployment:", DEPLOY_IMAGE, DEPLOY_GPU, DEPLOY_NGPU)

Training: us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-8:latest None None
Deployment: us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-8:latest None None

Set machine types

Next, set the machine types to use for training and prediction.

Set the variables TRAIN_COMPUTE and DEPLOY_COMPUTE to configure your compute resources for training and prediction.
machine type
- n1-standard: 3.75GB of memory per vCPU
- n1-highmem: 6.5GB of memory per vCPU
- n1-highcpu: 0.9 GB of memory per vCPU
vCPUs: number of [2, 4, 8, 16, 32, 64, 96 ]

Note: The following is not supported for training:

standard: 2 vCPUs
highcpu: 2, 4 and 8 vCPUs

Note: You may also use n2 and e2 machine types for training and deployment, but they do not support GPUs.

MACHINE_TYPE = "n1-standard"

VCPU = "4"
TRAIN_COMPUTE = MACHINE_TYPE + "-" + VCPU
print("Train machine type", TRAIN_COMPUTE)

MACHINE_TYPE = "n1-standard"

VCPU = "4"
DEPLOY_COMPUTE = MACHINE_TYPE + "-" + VCPU
print("Deploy machine type", DEPLOY_COMPUTE)

Train machine type n1-standard-4
Deploy machine type n1-standard-4

Design and Train the Convolutional Neural Network

Now you are ready to design and train your own custom-trained CNN model with Fashion MNIST data.

Convolutional Neural Networks(CNN)

In Introduction to Computer Vision lab, you saw how to train a Deep Neural Network(DNN) that can classify images of clothing. You can improve the accuracy of your DNN achieved by using convolutions. Convolutions narrow down the content of the image to focus on specific distinct details. Introduction to Convolutions lab explores convolutions in more detail. Please go through it to gain an in-depth understanding of convolutions.

A Convolutional Neural Network(CNN) is a neural network in which at least one layer is a convolutional layer. A typical CNN consists of some combination of the following layers:

Convolutional Layers
- Addition of convolutions enables more focused information to be passed to the dense layers possibly making the model more accurate.
Pooling Layers
- Designed to compress the image while maintaining the content of the features that are highlighted by the convolution.
Dense Layers
- Performs the task of classification based on the features extracted by the previous convolution layers.

Training script

In the next cell, you will write the contents of the training script, task.py. To summarize, the training script:

Gets the directory to save the model artifacts from the environment variable AIP_MODEL_DIR. This variable is set by the training service.
Loads the Fashion MNIST dataset from TensorFlow Datasets (tfds).
Builds a CNN model using tf.keras model API. (Please see the code comments for details about the layers in the CNN).
Compiles the model (compile()).
Trains the model (fit()) for the number of epochs specified by the argument args.epochs.
Saves the trained model (save(MODEL_DIR)) to the specified model directory.

%%writefile task.py
# Training Fashion MNIST using CNN

import tensorflow_datasets as tfds
import tensorflow as tf
from tensorflow.python.client import device_lib
import argparse
import os
import sys
tfds.disable_progress_bar()

parser = argparse.ArgumentParser()

parser.add_argument('--epochs', dest='epochs',
                    default=10, type=int,
                    help='Number of epochs.')

args = parser.parse_args()

print('Python Version = {}'.format(sys.version))
print('TensorFlow Version = {}'.format(tf.__version__))
print('TF_CONFIG = {}'.format(os.environ.get('TF_CONFIG', 'Not found')))
print('DEVICES', device_lib.list_local_devices())

# Define batch size
BATCH_SIZE = 32

# Load the dataset
datasets, info = tfds.load('fashion_mnist', with_info=True, as_supervised=True)

# Normalize and batch process the dataset
ds_train = datasets['train'].map(lambda x, y: (tf.cast(x, tf.float32)/255.0, y)).batch(BATCH_SIZE)


# Build the Convolutional Neural Network
# As input, a CNN takes tensors of shape (image_height, image_width, color_channels), ignoring the batch size.
# Here, the CNN is configured to process inputs of shape (28, 28, 1), which is the format of Fashion MNIST images
# You start with a stack of Conv2D and MaxPooling2D layers.

# Conv2D layer https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv2D
# The number of output channels for each Conv2D layer is controlled by the first argument (e.g., 32 or 64). 
# The next argument specifies the size of the Convolution, in this case, a 3x3 grid.
# `relu` is used as the activation function, which you might recall is the equivalent of returning x when x>0, else returning 0.

# MaxPooling2D layer https://www.tensorflow.org/api_docs/python/tf/keras/layers/MaxPool2D
# By specifying (2,2) for the MaxPooling, the effect is to quarter the size of the image.

# Dense layers
# The output from the convolution stack is flattened and passed to a set of dense layers for classification.
model = tf.keras.models.Sequential([                               
      tf.keras.layers.Conv2D(64, (3,3), activation=tf.nn.relu, input_shape=(28, 28, 1)),
      tf.keras.layers.MaxPooling2D(2, 2),
      tf.keras.layers.Conv2D(64, (3,3), activation=tf.nn.relu),
      tf.keras.layers.MaxPooling2D(2,2),
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(128, activation=tf.nn.relu),
      tf.keras.layers.Dense(10, activation=tf.nn.softmax)
    ])
model.compile(optimizer = tf.keras.optimizers.Adam(),
      loss = tf.keras.losses.SparseCategoricalCrossentropy(),
      metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])



# Train and save the model

MODEL_DIR = os.getenv("AIP_MODEL_DIR")

model.fit(ds_train, epochs=args.epochs)
model.save(MODEL_DIR)

Writing task.py

Define the command args for the training script

Prepare the command-line arguments to pass to your training script.

args: The command line arguments to pass to the corresponding Python module. In this example, they will be:
- "--epochs=" + EPOCHS: The number of epochs for training.

JOB_NAME = "custom_job_" + TIMESTAMP
MODEL_DIR = "{}/{}".format(BUCKET_NAME, JOB_NAME)

EPOCHS = 5

CMDARGS = [
    "--epochs=" + str(EPOCHS),
]

Train the model

Define your custom training job on Vertex AI.

Use the CustomTrainingJob class to define the job, which takes the following parameters:

display_name: The user-defined name of this training pipeline.
script_path: The local path to the training script.
container_uri: The URI of the training container image.
requirements: The list of Python package dependencies of the script.
model_serving_container_image_uri: The URI of a container that can serve predictions for your model — either a prebuilt container or a custom container.

Use the run function to start training, which takes the following parameters:

args: The command line arguments to be passed to the Python script.
replica_count: The number of worker replicas.
model_display_name: The display name of the Model if the script produces a managed Model.
machine_type: The type of machine to use for training.
accelerator_type: The hardware accelerator type.
accelerator_count: The number of accelerators to attach to a worker replica.

The run function creates a training pipeline that trains and creates a Model object. After the training pipeline completes, the run function returns the Model object.

You can read more about the CustomTrainingJob.run API here

job = aiplatform.CustomTrainingJob(
    display_name=JOB_NAME,
    script_path="task.py",
    container_uri=TRAIN_IMAGE,
    requirements=["tensorflow_datasets==4.6.0"],
    model_serving_container_image_uri=DEPLOY_IMAGE,
)

MODEL_DISPLAY_NAME = "fashionmnist-" + TIMESTAMP

# Start the training
if TRAIN_GPU:
    model = job.run(
        model_display_name=MODEL_DISPLAY_NAME,
        args=CMDARGS,
        replica_count=1,
        machine_type=TRAIN_COMPUTE,
        accelerator_type=TRAIN_GPU.name,
        accelerator_count=TRAIN_NGPU,
    )
else:
    model = job.run(
        model_display_name=MODEL_DISPLAY_NAME,
        args=CMDARGS,
        replica_count=1,
        machine_type=TRAIN_COMPUTE,
        accelerator_count=0,
    )

Training script copied to:
gs://qwiklabs-gcp-01-cb944d7414a1-bucket/aiplatform-2024-03-22-16:40:15.235-aiplatform_custom_trainer_script-0.1.tar.gz.
Training Output directory:
gs://qwiklabs-gcp-01-cb944d7414a1-bucket/aiplatform-custom-training-2024-03-22-16:40:15.392 
View Training:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/2899049254888669184?project=45654375613
CustomTrainingJob projects/45654375613/locations/us-central1/trainingPipelines/2899049254888669184 current state:
PipelineState.PIPELINE_STATE_PENDING
CustomTrainingJob projects/45654375613/locations/us-central1/trainingPipelines/2899049254888669184 current state:
PipelineState.PIPELINE_STATE_PENDING
CustomTrainingJob projects/45654375613/locations/us-central1/trainingPipelines/2899049254888669184 current state:
PipelineState.PIPELINE_STATE_PENDING
CustomTrainingJob projects/45654375613/locations/us-central1/trainingPipelines/2899049254888669184 current state:
PipelineState.PIPELINE_STATE_PENDING
View backing custom job:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/5303883494974291968?project=45654375613
CustomTrainingJob projects/45654375613/locations/us-central1/trainingPipelines/2899049254888669184 current state:
PipelineState.PIPELINE_STATE_RUNNING
CustomTrainingJob projects/45654375613/locations/us-central1/trainingPipelines/2899049254888669184 current state:
PipelineState.PIPELINE_STATE_RUNNING
CustomTrainingJob run completed. Resource name: projects/45654375613/locations/us-central1/trainingPipelines/2899049254888669184
Model available at projects/45654375613/locations/us-central1/models/5278008756557316096

The Training pipeline will take an average of 10 to 12 minutes to finish.

To view the training pipeline status, you have to navigate to Vertex AI ➞ Training please choose your region from the resouce panel in the training platform from the region

You can see the status of the current training pipeline as seen below

Once the model has been successfully trained, you can see a custom trained model if you head to Vertex AI ➞ Model Registry

Check progress

Go back to lab manual and Click Check my progress under Train the Model on Vertex AI section to verify that the pipeline ran successfully.

Deploy the model

Before you use your model to make predictions, you need to deploy it to an Endpoint. You can do this by calling the deploy function on the Model resource. This will do two things:

Create an Endpoint resource for deploying the Model resource to.
Deploy the Model resource to the Endpoint resource.

The function takes the following parameters:

deployed_model_display_name: A human readable name for the deployed model.
traffic_split: Percent of traffic at the endpoint that goes to this model, which is specified as a dictionary of one or more key/value pairs.
- If only one model, then specify as { “0”: 100 }, where “0” refers to this model being uploaded and 100 means 100% of the traffic.
- If there are existing models on the endpoint, for which the traffic will be split, then use model_id to specify as { “0”: percent, model_id: percent, … }, where model_id is the model id of an existing model to the deployed endpoint. The percents must add up to 100.
machine_type: The type of machine to use for training.
accelerator_type: The hardware accelerator type.
accelerator_count: The number of accelerators to attach to a worker replica.
starting_replica_count: The number of compute instances to initially provision.
max_replica_count: The maximum number of compute instances to scale to. In this tutorial, only one instance is provisioned.

Traffic split

The traffic_split parameter is specified as a Python dictionary. You can deploy more than one instance of your model to an endpoint, and then set the percentage of traffic that goes to each instance.

You can use a traffic split to introduce a new model gradually into production. For example, if you had one existing model in production with 100% of the traffic, you could deploy a new model to the same endpoint, direct 10% of traffic to it, and reduce the original model’s traffic to 90%. This allows you to monitor the new model’s performance while minimizing the distruption to the majority of users.

Compute instance scaling

You can specify a single instance (or node) to serve your online prediction requests. This tutorial uses a single node, so the variables MIN_NODES and MAX_NODES are both set to 1.

If you want to use multiple nodes to serve your online prediction requests, set MAX_NODES to the maximum number of nodes you want to use. Vertex AI autoscales the number of nodes used to serve your predictions, up to the maximum number you set. Refer to the pricing page to understand the costs of autoscaling with multiple nodes.

Endpoint

The method will block until the model is deployed and eventually return an Endpoint object. If this is the first time a model is deployed to the endpoint, it may take a few additional minutes to complete provisioning of resources.

DEPLOYED_NAME = "fashionmnist_deployed-" + TIMESTAMP

TRAFFIC_SPLIT = {"0": 100}

MIN_NODES = 1
MAX_NODES = 1

if DEPLOY_GPU:
    endpoint = model.deploy(
        deployed_model_display_name=DEPLOYED_NAME,
        traffic_split=TRAFFIC_SPLIT,
        machine_type=DEPLOY_COMPUTE,
        accelerator_type=DEPLOY_GPU.name,
        accelerator_count=DEPLOY_NGPU,
        min_replica_count=MIN_NODES,
        max_replica_count=MAX_NODES,
    )
else:
    endpoint = model.deploy(
        deployed_model_display_name=DEPLOYED_NAME,
        traffic_split=TRAFFIC_SPLIT,
        machine_type=DEPLOY_COMPUTE,
        accelerator_type=None,
        accelerator_count=0,
        min_replica_count=MIN_NODES,
        max_replica_count=MAX_NODES,
    )

Creating Endpoint
Create Endpoint backing LRO: projects/45654375613/locations/us-central1/endpoints/4106150363185283072/operations/1516086931086114816
Endpoint created. Resource name: projects/45654375613/locations/us-central1/endpoints/4106150363185283072
To use this Endpoint in another session:
endpoint = aiplatform.Endpoint('projects/45654375613/locations/us-central1/endpoints/4106150363185283072')
Deploying model to Endpoint : projects/45654375613/locations/us-central1/endpoints/4106150363185283072
Deploy Endpoint model backing LRO: projects/45654375613/locations/us-central1/endpoints/4106150363185283072/operations/543309411574087680
Endpoint model deployed. Resource name: projects/45654375613/locations/us-central1/endpoints/4106150363185283072

The Deployment process will take an average of 15 to 17 minutes to finish.

In order to view your deployed endpoint, you can head over to Vertex AI ➞ Online prediction

You can check if your endpoint is in the list of the currently deployed/deploying endpoints.

To view the details of the endpoint that is currently being deployed, you can simply click on the endpoint name.

Once deployment is successful, the status will change to Ready in the above screenshot.

Check progress

Go back to lab manual and Click Check my progress under Deploy the Model section to verify that the deployment was a success.

Make an online prediction request

You will now see how an online prediction request can be sent to your deployed model.

First, you will load the test split of the Fashion MNIST dataset using tfds.

Set the batch size to -1 to load the entire dataset.

import tensorflow_datasets as tfds
import numpy as np

tfds.disable_progress_bar()

datasets, info = tfds.load('fashion_mnist', batch_size=-1, with_info=True, as_supervised=True)

test_dataset = datasets['test']

2024-03-22 17:08:30.846380: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-22 17:08:34.906193: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-22 17:08:39.251268: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-22 17:08:50.656143: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT


[1mDownloading and preparing dataset 29.45 MiB (download: 29.45 MiB, generated: 36.42 MiB, total: 65.87 MiB) to /home/jupyter/tensorflow_datasets/fashion_mnist/3.0.1...[0m
[1mDataset fashion_mnist downloaded and prepared to /home/jupyter/tensorflow_datasets/fashion_mnist/3.0.1. Subsequent calls will reuse this data.[0m

Now you will convert the dataset to NumPy arrays (images, labels). The images used for training the model were normalized. Hence the trained model expects normalized images for prediction. You will normalize the test images by dividing each pixel by 255.

x_test, y_test = tfds.as_numpy(test_dataset)

# Normalize (rescale) the pixel data by dividing each pixel by 255. 
x_test = x_test.astype('float32') / 255.

Ensure the shapes are correct.

x_test.shape, y_test.shape

((10000, 28, 28, 1), (10000,))

#@title Pick the number of test images
NUM_TEST_IMAGES = 20 #@param {type:"slider", min:1, max:20, step:1}
x_test, y_test = x_test[:NUM_TEST_IMAGES], y_test[:NUM_TEST_IMAGES]

Send the prediction request

Now that you have prepared the test images, you can use them to send a prediction request. Use the Endpoint object’s predict function, which takes the following parameters:

instances: A list of image instances. According to your custom model, each image instance should be a 3-dimensional matrix of floats. This was prepared in the previous step.

The predict function returns a list, where each element in the list corresponds to the corresponding image in the request. The ouput for each prediction will contain:

Confidence level for the prediction (predictions), between 0 and 1, for each of the ten classes.

Now you can run a quick evaluation on the prediction results:

np.argmax: Convert each list of confidence levels to a label
Compare the predicted labels to the actual labels
Calculate accuracy as no: of correct predictions/total no: of predictions

predictions = endpoint.predict(instances=x_test.tolist())
y_predicted = np.argmax(predictions.predictions, axis=1)

correct = sum(y_predicted == np.array(y_test.tolist()))
total = len(y_predicted)
print(
    f"Correct predictions = {correct}, Total predictions = {total}, Accuracy = {correct/total}"
)

Correct predictions = 18, Total predictions = 20, Accuracy = 0.9