This tutorial demonstrates how to
You will train a Convolutional Neural Network(CNN) to classify images of clothing from a dataset called Fashion MNIST.
This dataset contains 70,000 items of clothing belonging to 10 different categories of clothing. The images show individual articles of clothing at low resolution (28 by 28 pixels), as seen here:
In this notebook, you will use 60,000 images to train the network and 10,000 images to evaluate how accurately the network learned to classify the images.
The Fashion MNIST data is available in tensorflow datasets(tfds).
In this notebook, you will create a custom-trained CNN model from a Python script in a Docker container using the Vertex SDK for Python, and then do a prediction on the deployed model by sending data. Alternatively, you can create custom-trained models using gcloud
command-line tool, or online using the Cloud Console.
The steps performed include:
Model
resource to a serving Endpoint
resource.Model
resource.!pip install google-cloud-aiplatform
!pip install --user tensorflow-text
!pip install --user tensorflow-datasets
!pip install protobuf==3.20.1
Requirement already satisfied: google-cloud-aiplatform in /opt/conda/lib/python3.10/site-packages (1.43.0)
Requirement already satisfied: google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.34.1 in /opt/conda/lib/python3.10/site-packages (from google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.34.1->google-cloud-aiplatform) (1.34.1)
Requirement already satisfied: google-auth<3.0.0dev,>=2.14.1 in /opt/conda/lib/python3.10/site-packages (from google-cloud-aiplatform) (2.28.1)
Requirement already satisfied: proto-plus<2.0.0dev,>=1.22.0 in /opt/conda/lib/python3.10/site-packages (from google-cloud-aiplatform) (1.23.0)
Requirement already satisfied: protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5 in /opt/conda/lib/python3.10/site-packages (from google-cloud-aiplatform) (3.20.3)
Requirement already satisfied: packaging>=14.3 in /opt/conda/lib/python3.10/site-packages (from google-cloud-aiplatform) (23.2)
Requirement already satisfied: google-cloud-storage<3.0.0dev,>=1.32.0 in /opt/conda/lib/python3.10/site-packages (from google-cloud-aiplatform) (2.14.0)
Requirement already satisfied: google-cloud-bigquery<4.0.0dev,>=1.15.0 in /opt/conda/lib/python3.10/site-packages (from google-cloud-aiplatform) (3.18.0)
Requirement already satisfied: google-cloud-resource-manager<3.0.0dev,>=1.3.3 in /opt/conda/lib/python3.10/site-packages (from google-cloud-aiplatform) (1.12.3)
Requirement already satisfied: shapely<3.0.0dev in /opt/conda/lib/python3.10/site-packages (from google-cloud-aiplatform) (2.0.3)
Requirement already satisfied: googleapis-common-protos<2.0dev,>=1.56.2 in /opt/conda/lib/python3.10/site-packages (from google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.34.1->google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.34.1->google-cloud-aiplatform) (1.62.0)
Requirement already satisfied: requests<3.0.0dev,>=2.18.0 in /opt/conda/lib/python3.10/site-packages (from google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.34.1->google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.34.1->google-cloud-aiplatform) (2.31.0)
Requirement already satisfied: grpcio<2.0dev,>=1.33.2 in /opt/conda/lib/python3.10/site-packages (from google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.34.1->google-cloud-aiplatform) (1.62.0)
Requirement already satisfied: grpcio-status<2.0dev,>=1.33.2 in /opt/conda/lib/python3.10/site-packages (from google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.34.1->google-cloud-aiplatform) (1.48.2)
Requirement already satisfied: cachetools<6.0,>=2.0.0 in /opt/conda/lib/python3.10/site-packages (from google-auth<3.0.0dev,>=2.14.1->google-cloud-aiplatform) (5.3.3)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /opt/conda/lib/python3.10/site-packages (from google-auth<3.0.0dev,>=2.14.1->google-cloud-aiplatform) (0.3.0)
Requirement already satisfied: rsa<5,>=3.1.4 in /opt/conda/lib/python3.10/site-packages (from google-auth<3.0.0dev,>=2.14.1->google-cloud-aiplatform) (4.9)
Requirement already satisfied: google-cloud-core<3.0.0dev,>=1.6.0 in /opt/conda/lib/python3.10/site-packages (from google-cloud-bigquery<4.0.0dev,>=1.15.0->google-cloud-aiplatform) (2.4.1)
Requirement already satisfied: google-resumable-media<3.0dev,>=0.6.0 in /opt/conda/lib/python3.10/site-packages (from google-cloud-bigquery<4.0.0dev,>=1.15.0->google-cloud-aiplatform) (2.7.0)
Requirement already satisfied: python-dateutil<3.0dev,>=2.7.2 in /opt/conda/lib/python3.10/site-packages (from google-cloud-bigquery<4.0.0dev,>=1.15.0->google-cloud-aiplatform) (2.9.0)
Requirement already satisfied: grpc-google-iam-v1<1.0.0dev,>=0.12.4 in /opt/conda/lib/python3.10/site-packages (from google-cloud-resource-manager<3.0.0dev,>=1.3.3->google-cloud-aiplatform) (0.13.0)
Requirement already satisfied: google-crc32c<2.0dev,>=1.0 in /opt/conda/lib/python3.10/site-packages (from google-cloud-storage<3.0.0dev,>=1.32.0->google-cloud-aiplatform) (1.5.0)
Requirement already satisfied: numpy<2,>=1.14 in /opt/conda/lib/python3.10/site-packages (from shapely<3.0.0dev->google-cloud-aiplatform) (1.25.2)
Requirement already satisfied: pyasn1<0.6.0,>=0.4.6 in /opt/conda/lib/python3.10/site-packages (from pyasn1-modules>=0.2.1->google-auth<3.0.0dev,>=2.14.1->google-cloud-aiplatform) (0.5.1)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.10/site-packages (from python-dateutil<3.0dev,>=2.7.2->google-cloud-bigquery<4.0.0dev,>=1.15.0->google-cloud-aiplatform) (1.16.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests<3.0.0dev,>=2.18.0->google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.34.1->google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.34.1->google-cloud-aiplatform) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests<3.0.0dev,>=2.18.0->google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.34.1->google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.34.1->google-cloud-aiplatform) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/conda/lib/python3.10/site-packages (from requests<3.0.0dev,>=2.18.0->google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.34.1->google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.34.1->google-cloud-aiplatform) (1.26.18)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests<3.0.0dev,>=2.18.0->google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.34.1->google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.34.1->google-cloud-aiplatform) (2024.2.2)
Collecting tensorflow-text
Downloading tensorflow_text-2.16.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.0 kB)
Collecting tensorflow<2.17,>=2.16.1 (from tensorflow-text)
Downloading tensorflow-2.16.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.3 kB)
Requirement already satisfied: absl-py>=1.0.0 in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.17,>=2.16.1->tensorflow-text) (2.1.0)
Collecting astunparse>=1.6.0 (from tensorflow<2.17,>=2.16.1->tensorflow-text)
Downloading astunparse-1.6.3-py2.py3-none-any.whl.metadata (4.4 kB)
Collecting flatbuffers>=23.5.26 (from tensorflow<2.17,>=2.16.1->tensorflow-text)
Downloading flatbuffers-24.3.7-py2.py3-none-any.whl.metadata (849 bytes)
Collecting gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 (from tensorflow<2.17,>=2.16.1->tensorflow-text)
Downloading gast-0.5.4-py3-none-any.whl.metadata (1.3 kB)
Collecting google-pasta>=0.1.1 (from tensorflow<2.17,>=2.16.1->tensorflow-text)
Downloading google_pasta-0.2.0-py3-none-any.whl.metadata (814 bytes)
Collecting h5py>=3.10.0 (from tensorflow<2.17,>=2.16.1->tensorflow-text)
Downloading h5py-3.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.5 kB)
Collecting libclang>=13.0.0 (from tensorflow<2.17,>=2.16.1->tensorflow-text)
Downloading libclang-18.1.1-py2.py3-none-manylinux2010_x86_64.whl.metadata (5.2 kB)
Collecting ml-dtypes~=0.3.1 (from tensorflow<2.17,>=2.16.1->tensorflow-text)
Downloading ml_dtypes-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (20 kB)
Collecting opt-einsum>=2.3.2 (from tensorflow<2.17,>=2.16.1->tensorflow-text)
Downloading opt_einsum-3.3.0-py3-none-any.whl.metadata (6.5 kB)
Requirement already satisfied: packaging in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.17,>=2.16.1->tensorflow-text) (23.2)
Requirement already satisfied: protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3 in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.17,>=2.16.1->tensorflow-text) (3.20.3)
Requirement already satisfied: requests<3,>=2.21.0 in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.17,>=2.16.1->tensorflow-text) (2.31.0)
Requirement already satisfied: setuptools in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.17,>=2.16.1->tensorflow-text) (69.1.1)
Requirement already satisfied: six>=1.12.0 in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.17,>=2.16.1->tensorflow-text) (1.16.0)
Collecting termcolor>=1.1.0 (from tensorflow<2.17,>=2.16.1->tensorflow-text)
Downloading termcolor-2.4.0-py3-none-any.whl.metadata (6.1 kB)
Requirement already satisfied: typing-extensions>=3.6.6 in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.17,>=2.16.1->tensorflow-text) (4.10.0)
Requirement already satisfied: wrapt>=1.11.0 in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.17,>=2.16.1->tensorflow-text) (1.16.0)
Requirement already satisfied: grpcio<2.0,>=1.24.3 in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.17,>=2.16.1->tensorflow-text) (1.62.0)
Collecting tensorboard<2.17,>=2.16 (from tensorflow<2.17,>=2.16.1->tensorflow-text)
Downloading tensorboard-2.16.2-py3-none-any.whl.metadata (1.6 kB)
Collecting keras>=3.0.0 (from tensorflow<2.17,>=2.16.1->tensorflow-text)
Downloading keras-3.1.1-py3-none-any.whl.metadata (5.6 kB)
Collecting tensorflow-io-gcs-filesystem>=0.23.1 (from tensorflow<2.17,>=2.16.1->tensorflow-text)
Downloading tensorflow_io_gcs_filesystem-0.36.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (14 kB)
Requirement already satisfied: numpy<2.0.0,>=1.23.5 in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.17,>=2.16.1->tensorflow-text) (1.25.2)
Requirement already satisfied: wheel<1.0,>=0.23.0 in /opt/conda/lib/python3.10/site-packages (from astunparse>=1.6.0->tensorflow<2.17,>=2.16.1->tensorflow-text) (0.42.0)
Requirement already satisfied: rich in /opt/conda/lib/python3.10/site-packages (from keras>=3.0.0->tensorflow<2.17,>=2.16.1->tensorflow-text) (13.7.1)
Collecting namex (from keras>=3.0.0->tensorflow<2.17,>=2.16.1->tensorflow-text)
Downloading namex-0.0.7-py3-none-any.whl.metadata (246 bytes)
Collecting optree (from keras>=3.0.0->tensorflow<2.17,>=2.16.1->tensorflow-text)
Downloading optree-0.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (45 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.3/45.3 kB[0m [31m190.7 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hRequirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests<3,>=2.21.0->tensorflow<2.17,>=2.16.1->tensorflow-text) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests<3,>=2.21.0->tensorflow<2.17,>=2.16.1->tensorflow-text) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/conda/lib/python3.10/site-packages (from requests<3,>=2.21.0->tensorflow<2.17,>=2.16.1->tensorflow-text) (1.26.18)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests<3,>=2.21.0->tensorflow<2.17,>=2.16.1->tensorflow-text) (2024.2.2)
Collecting markdown>=2.6.8 (from tensorboard<2.17,>=2.16->tensorflow<2.17,>=2.16.1->tensorflow-text)
Downloading Markdown-3.6-py3-none-any.whl.metadata (7.0 kB)
Collecting tensorboard-data-server<0.8.0,>=0.7.0 (from tensorboard<2.17,>=2.16->tensorflow<2.17,>=2.16.1->tensorflow-text)
Downloading tensorboard_data_server-0.7.2-py3-none-manylinux_2_31_x86_64.whl.metadata (1.1 kB)
Collecting werkzeug>=1.0.1 (from tensorboard<2.17,>=2.16->tensorflow<2.17,>=2.16.1->tensorflow-text)
Downloading werkzeug-3.0.1-py3-none-any.whl.metadata (4.1 kB)
Requirement already satisfied: MarkupSafe>=2.1.1 in /opt/conda/lib/python3.10/site-packages (from werkzeug>=1.0.1->tensorboard<2.17,>=2.16->tensorflow<2.17,>=2.16.1->tensorflow-text) (2.1.5)
Requirement already satisfied: markdown-it-py>=2.2.0 in /opt/conda/lib/python3.10/site-packages (from rich->keras>=3.0.0->tensorflow<2.17,>=2.16.1->tensorflow-text) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /opt/conda/lib/python3.10/site-packages (from rich->keras>=3.0.0->tensorflow<2.17,>=2.16.1->tensorflow-text) (2.17.2)
Requirement already satisfied: mdurl~=0.1 in /opt/conda/lib/python3.10/site-packages (from markdown-it-py>=2.2.0->rich->keras>=3.0.0->tensorflow<2.17,>=2.16.1->tensorflow-text) (0.1.2)
Downloading tensorflow_text-2.16.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.2 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m0m
[?25hDownloading tensorflow-2.16.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (589.8 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m589.8/589.8 MB[0m [31m865.4 kB/s[0m eta [36m0:00:00[0m0:01[0m00:07[0m
[?25hDownloading astunparse-1.6.3-py2.py3-none-any.whl (12 kB)
Downloading flatbuffers-24.3.7-py2.py3-none-any.whl (26 kB)
Downloading gast-0.5.4-py3-none-any.whl (19 kB)
Downloading google_pasta-0.2.0-py3-none-any.whl (57 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.5/57.5 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading h5py-3.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.8 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.8/4.8 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading keras-3.1.1-py3-none-any.whl (1.1 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading libclang-18.1.1-py2.py3-none-manylinux2010_x86_64.whl (24.5 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.5/24.5 MB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading ml_dtypes-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.2 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading opt_einsum-3.3.0-py3-none-any.whl (65 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.5/65.5 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading tensorboard-2.16.2-py3-none-any.whl (5.5 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.5/5.5 MB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading tensorflow_io_gcs_filesystem-0.36.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.1 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.1/5.1 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading termcolor-2.4.0-py3-none-any.whl (7.7 kB)
Downloading Markdown-3.6-py3-none-any.whl (105 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.4/105.4 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hDownloading tensorboard_data_server-0.7.2-py3-none-manylinux_2_31_x86_64.whl (6.6 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.6/6.6 MB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading werkzeug-3.0.1-py3-none-any.whl (226 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m226.7/226.7 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading namex-0.0.7-py3-none-any.whl (5.8 kB)
Downloading optree-0.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (286 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m286.8/286.8 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: namex, libclang, flatbuffers, werkzeug, termcolor, tensorflow-io-gcs-filesystem, tensorboard-data-server, optree, opt-einsum, ml-dtypes, markdown, h5py, google-pasta, gast, astunparse, tensorboard, keras, tensorflow, tensorflow-text
[33m WARNING: The script markdown_py is installed in '/home/jupyter/.local/bin' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.[0m[33m
[0m[33m WARNING: The script tensorboard is installed in '/home/jupyter/.local/bin' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.[0m[33m
[0m[33m WARNING: The scripts import_pb_to_tensorboard, saved_model_cli, tensorboard, tf_upgrade_v2, tflite_convert, toco and toco_from_protos are installed in '/home/jupyter/.local/bin' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.[0m[33m
[0mSuccessfully installed astunparse-1.6.3 flatbuffers-24.3.7 gast-0.5.4 google-pasta-0.2.0 h5py-3.10.0 keras-3.1.1 libclang-18.1.1 markdown-3.6 ml-dtypes-0.3.2 namex-0.0.7 opt-einsum-3.3.0 optree-0.10.0 tensorboard-2.16.2 tensorboard-data-server-0.7.2 tensorflow-2.16.1 tensorflow-io-gcs-filesystem-0.36.0 tensorflow-text-2.16.1 termcolor-2.4.0 werkzeug-3.0.1
Collecting tensorflow-datasets
Downloading tensorflow_datasets-4.9.4-py3-none-any.whl.metadata (9.2 kB)
Requirement already satisfied: absl-py in /opt/conda/lib/python3.10/site-packages (from tensorflow-datasets) (2.1.0)
Requirement already satisfied: click in /opt/conda/lib/python3.10/site-packages (from tensorflow-datasets) (8.1.7)
Requirement already satisfied: dm-tree in /opt/conda/lib/python3.10/site-packages (from tensorflow-datasets) (0.1.8)
Collecting etils>=0.9.0 (from etils[enp,epath,etree]>=0.9.0->tensorflow-datasets)
Downloading etils-1.7.0-py3-none-any.whl.metadata (6.4 kB)
Requirement already satisfied: numpy in /opt/conda/lib/python3.10/site-packages (from tensorflow-datasets) (1.25.2)
Collecting promise (from tensorflow-datasets)
Downloading promise-2.3.tar.gz (19 kB)
Preparing metadata (setup.py) ... [?25ldone
[?25hRequirement already satisfied: protobuf>=3.20 in /opt/conda/lib/python3.10/site-packages (from tensorflow-datasets) (3.20.3)
Requirement already satisfied: psutil in /opt/conda/lib/python3.10/site-packages (from tensorflow-datasets) (5.9.3)
Requirement already satisfied: requests>=2.19.0 in /opt/conda/lib/python3.10/site-packages (from tensorflow-datasets) (2.31.0)
Collecting tensorflow-metadata (from tensorflow-datasets)
Downloading tensorflow_metadata-1.14.0-py3-none-any.whl.metadata (2.1 kB)
Requirement already satisfied: termcolor in /home/jupyter/.local/lib/python3.10/site-packages (from tensorflow-datasets) (2.4.0)
Requirement already satisfied: toml in /opt/conda/lib/python3.10/site-packages (from tensorflow-datasets) (0.10.2)
Requirement already satisfied: tqdm in /opt/conda/lib/python3.10/site-packages (from tensorflow-datasets) (4.66.2)
Requirement already satisfied: wrapt in /opt/conda/lib/python3.10/site-packages (from tensorflow-datasets) (1.16.0)
Collecting array-record>=0.5.0 (from tensorflow-datasets)
Downloading array_record-0.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (503 bytes)
Requirement already satisfied: fsspec in /opt/conda/lib/python3.10/site-packages (from etils[enp,epath,etree]>=0.9.0->tensorflow-datasets) (2024.2.0)
Requirement already satisfied: importlib_resources in /opt/conda/lib/python3.10/site-packages (from etils[enp,epath,etree]>=0.9.0->tensorflow-datasets) (6.1.2)
Requirement already satisfied: typing_extensions in /opt/conda/lib/python3.10/site-packages (from etils[enp,epath,etree]>=0.9.0->tensorflow-datasets) (4.10.0)
Requirement already satisfied: zipp in /opt/conda/lib/python3.10/site-packages (from etils[enp,epath,etree]>=0.9.0->tensorflow-datasets) (3.17.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests>=2.19.0->tensorflow-datasets) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests>=2.19.0->tensorflow-datasets) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/conda/lib/python3.10/site-packages (from requests>=2.19.0->tensorflow-datasets) (1.26.18)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests>=2.19.0->tensorflow-datasets) (2024.2.2)
Requirement already satisfied: six in /opt/conda/lib/python3.10/site-packages (from promise->tensorflow-datasets) (1.16.0)
Collecting absl-py (from tensorflow-datasets)
Downloading absl_py-1.4.0-py3-none-any.whl.metadata (2.3 kB)
Requirement already satisfied: googleapis-common-protos<2,>=1.52.0 in /opt/conda/lib/python3.10/site-packages (from tensorflow-metadata->tensorflow-datasets) (1.62.0)
Downloading tensorflow_datasets-4.9.4-py3-none-any.whl (5.1 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.1/5.1 MB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading array_record-0.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading etils-1.7.0-py3-none-any.whl (152 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m152.4/152.4 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tensorflow_metadata-1.14.0-py3-none-any.whl (28 kB)
Downloading absl_py-1.4.0-py3-none-any.whl (126 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.5/126.5 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: promise
Building wheel for promise (setup.py) ... [?25ldone
[?25h Created wheel for promise: filename=promise-2.3-py3-none-any.whl size=21483 sha256=5a82f3e23740d4a8b542b0fdd6939c2c2dce6d3dbc0ae817bd1c58970cc60086
Stored in directory: /home/jupyter/.cache/pip/wheels/54/4e/28/3ed0e1c8a752867445bab994d2340724928aa3ab059c57c8db
Successfully built promise
Installing collected packages: promise, etils, absl-py, tensorflow-metadata, array-record, tensorflow-datasets
[33m WARNING: The script tfds is installed in '/home/jupyter/.local/bin' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.[0m[33m
[0mSuccessfully installed absl-py-1.4.0 array-record-0.5.0 etils-1.7.0 promise-2.3 tensorflow-datasets-4.9.4 tensorflow-metadata-1.14.0
Collecting protobuf==3.20.1
Downloading protobuf-3.20.1-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (698 bytes)
Downloading protobuf-3.20.1-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: protobuf
Attempting uninstall: protobuf
Found existing installation: protobuf 3.20.3
Uninstalling protobuf-3.20.3:
Successfully uninstalled protobuf-3.20.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.16.1 requires protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3, but you have protobuf 3.20.1 which is incompatible.
tensorflow-metadata 1.14.0 requires protobuf<4.21,>=3.20.3, but you have protobuf 3.20.1 which is incompatible.
google-api-core 1.34.1 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<4.0.0dev,>=3.19.5, but you have protobuf 3.20.1 which is incompatible.
google-cloud-aiplatform 1.43.0 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.20.1 which is incompatible.
google-cloud-artifact-registry 1.11.3 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.20.1 which is incompatible.
google-cloud-bigquery-storage 2.24.0 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.20.1 which is incompatible.
google-cloud-language 2.13.3 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.20.1 which is incompatible.
google-cloud-monitoring 2.19.3 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.20.1 which is incompatible.
google-cloud-resource-manager 1.12.3 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.20.1 which is incompatible.
googleapis-common-protos 1.62.0 requires protobuf!=3.20.0,!=3.20.1,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0.dev0,>=3.19.5, but you have protobuf 3.20.1 which is incompatible.
grpc-google-iam-v1 0.13.0 requires protobuf!=3.20.0,!=3.20.1,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.20.1 which is incompatible.[0m[31m
[0mSuccessfully installed protobuf-3.20.1
Once you’ve installed everything, you need to restart the notebook kernel so it can find the packages.
import os
if not os.getenv("IS_TESTING"):
# Automatically restart kernel after installs
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)
If you don’t know your project ID, you may be able to get your project ID using gcloud
.
import os
PROJECT_ID = "qwiklabs-gcp-01-cb944d7414a1"
if not os.getenv("IS_TESTING"):
# Get your Google Cloud project ID from gcloud
shell_output=!gcloud config list --format 'value(core.project)' 2>/dev/null
PROJECT_ID = shell_output[0]
print("Project ID: ", PROJECT_ID)
Project ID: qwiklabs-gcp-01-cb944d7414a1
If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, you create a timestamp for each instance session, and append it onto the name of resources you create in this tutorial.
from datetime import datetime
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
The following steps are required, regardless of your notebook environment.
When you submit a training job using the Cloud SDK, you upload a Python package containing your training code to a Cloud Storage bucket. Vertex AI runs the code from this package. In this tutorial, Vertex AI also saves the trained model that results from your job in the same bucket. Using this model artifact, you can then create Vertex AI model and endpoint resources in order to serve online predictions.
Set the name of your Cloud Storage bucket below. It must be unique across all Cloud Storage buckets.
You may also change the REGION
variable, which is used for operations
throughout the rest of this notebook. Make sure to choose a region where Vertex AI services are
available. You may
not use a Multi-Regional Storage bucket for training with Vertex AI (Replace the REGION
with the associated region mentioned in the qwiklabs resource panel)
BUCKET_NAME = "gs://qwiklabs-gcp-01-cb944d7414a1-bucket"
REGION = "us-central1" # @param {type:"string"}
PROJECT_ID
'qwiklabs-gcp-01-cb944d7414a1'
if BUCKET_NAME == "" or BUCKET_NAME is None or BUCKET_NAME == "gs://[your-bucket-name]":
BUCKET_NAME = "gs://" + PROJECT_ID
Only if your bucket doesn’t already exist: Run the following cells to create your Cloud Storage bucket.
BUCKET_NAME
# ! gsutil mb -l $REGION $BUCKET_NAME
Go back to lab manual and Click Check my progress under Create a Cloud Storage bucket section to verify that the bucket is created.
Next, set up some variables used throughout the tutorial.
Import the Vertex SDK for Python into your Python environment and initialize it.
import os
import sys
from google.cloud import aiplatform
from google.cloud.aiplatform import gapic as aip
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_NAME)
Here to run a container image on a CPU, we set the variables TRAIN_GPU/TRAIN_NGPU
and DEPLOY_GPU/DEPLOY_NGPU
to (None, None)
since this notebook is meant to be run in a Qwiklab environment where GPUs cannot be provisioned.
Note: If you happen to be running this notebook from your personal GCP account, set the variables TRAIN_GPU/TRAIN_NGPU
and DEPLOY_GPU/DEPLOY_NGPU
to use a container image supporting a GPU and the number of GPUs allocated to the virtual machine (VM) instance. For example, to use a GPU container image with 4 Nvidia Tesla K80 GPUs allocated to each VM, you would specify:
(aip.AcceleratorType.NVIDIA_TESLA_K80, 4)
See the locations where accelerators are available.
TRAIN_GPU, TRAIN_NGPU = (None, None)
DEPLOY_GPU, DEPLOY_NGPU = (None, None)
There are two ways you can train a custom model using a container image:
Use a Google Cloud prebuilt container. If you use a prebuilt container, you will additionally specify a Python package to install into the container image. This Python package contains your code for training a custom model.
Use your own custom container image. If you use your own container, the container needs to contain your code for training a custom model.
Here you will use pre-built containers provided by Vertex AI to run training and prediction.
For the latest list, see Pre-built containers for training and Pre-built containers for prediction
TRAIN_VERSION = "tf-cpu.2-8"
DEPLOY_VERSION = "tf2-cpu.2-8"
TRAIN_IMAGE = "us-docker.pkg.dev/vertex-ai/training/{}:latest".format(TRAIN_VERSION)
DEPLOY_IMAGE = "us-docker.pkg.dev/vertex-ai/prediction/{}:latest".format(DEPLOY_VERSION)
print("Training:", TRAIN_IMAGE, TRAIN_GPU, TRAIN_NGPU)
print("Deployment:", DEPLOY_IMAGE, DEPLOY_GPU, DEPLOY_NGPU)
Training: us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-8:latest None None
Deployment: us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-8:latest None None
Next, set the machine types to use for training and prediction.
TRAIN_COMPUTE
and DEPLOY_COMPUTE
to configure your compute resources for training and prediction.machine type
n1-standard
: 3.75GB of memory per vCPUn1-highmem
: 6.5GB of memory per vCPUn1-highcpu
: 0.9 GB of memory per vCPUvCPUs
: number of [2, 4, 8, 16, 32, 64, 96 ]Note: The following is not supported for training:
standard
: 2 vCPUshighcpu
: 2, 4 and 8 vCPUsNote: You may also use n2 and e2 machine types for training and deployment, but they do not support GPUs.
MACHINE_TYPE = "n1-standard"
VCPU = "4"
TRAIN_COMPUTE = MACHINE_TYPE + "-" + VCPU
print("Train machine type", TRAIN_COMPUTE)
MACHINE_TYPE = "n1-standard"
VCPU = "4"
DEPLOY_COMPUTE = MACHINE_TYPE + "-" + VCPU
print("Deploy machine type", DEPLOY_COMPUTE)
Train machine type n1-standard-4
Deploy machine type n1-standard-4
Now you are ready to design and train your own custom-trained CNN model with Fashion MNIST data.
In Introduction to Computer Vision lab, you saw how to train a Deep Neural Network(DNN) that can classify images of clothing. You can improve the accuracy of your DNN achieved by using convolutions. Convolutions narrow down the content of the image to focus on specific distinct details. Introduction to Convolutions lab explores convolutions in more detail. Please go through it to gain an in-depth understanding of convolutions.
A Convolutional Neural Network(CNN) is a neural network in which at least one layer is a convolutional layer. A typical CNN consists of some combination of the following layers:
In the next cell, you will write the contents of the training script, task.py
. To summarize, the training script:
AIP_MODEL_DIR
. This variable is set by the training service.tf.keras
model API. (Please see the code comments for details about the layers in the CNN).compile()
).fit()
) for the number of epochs specified by the argument args.epochs
.save(MODEL_DIR)
) to the specified model directory.%%writefile task.py
# Training Fashion MNIST using CNN
import tensorflow_datasets as tfds
import tensorflow as tf
from tensorflow.python.client import device_lib
import argparse
import os
import sys
tfds.disable_progress_bar()
parser = argparse.ArgumentParser()
parser.add_argument('--epochs', dest='epochs',
default=10, type=int,
help='Number of epochs.')
args = parser.parse_args()
print('Python Version = {}'.format(sys.version))
print('TensorFlow Version = {}'.format(tf.__version__))
print('TF_CONFIG = {}'.format(os.environ.get('TF_CONFIG', 'Not found')))
print('DEVICES', device_lib.list_local_devices())
# Define batch size
BATCH_SIZE = 32
# Load the dataset
datasets, info = tfds.load('fashion_mnist', with_info=True, as_supervised=True)
# Normalize and batch process the dataset
ds_train = datasets['train'].map(lambda x, y: (tf.cast(x, tf.float32)/255.0, y)).batch(BATCH_SIZE)
# Build the Convolutional Neural Network
# As input, a CNN takes tensors of shape (image_height, image_width, color_channels), ignoring the batch size.
# Here, the CNN is configured to process inputs of shape (28, 28, 1), which is the format of Fashion MNIST images
# You start with a stack of Conv2D and MaxPooling2D layers.
# Conv2D layer https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv2D
# The number of output channels for each Conv2D layer is controlled by the first argument (e.g., 32 or 64).
# The next argument specifies the size of the Convolution, in this case, a 3x3 grid.
# `relu` is used as the activation function, which you might recall is the equivalent of returning x when x>0, else returning 0.
# MaxPooling2D layer https://www.tensorflow.org/api_docs/python/tf/keras/layers/MaxPool2D
# By specifying (2,2) for the MaxPooling, the effect is to quarter the size of the image.
# Dense layers
# The output from the convolution stack is flattened and passed to a set of dense layers for classification.
model = tf.keras.models.Sequential([
tf.keras.layers.Conv2D(64, (3,3), activation=tf.nn.relu, input_shape=(28, 28, 1)),
tf.keras.layers.MaxPooling2D(2, 2),
tf.keras.layers.Conv2D(64, (3,3), activation=tf.nn.relu),
tf.keras.layers.MaxPooling2D(2,2),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation=tf.nn.relu),
tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer = tf.keras.optimizers.Adam(),
loss = tf.keras.losses.SparseCategoricalCrossentropy(),
metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])
# Train and save the model
MODEL_DIR = os.getenv("AIP_MODEL_DIR")
model.fit(ds_train, epochs=args.epochs)
model.save(MODEL_DIR)
Writing task.py
Prepare the command-line arguments to pass to your training script.
args
: The command line arguments to pass to the corresponding Python module. In this example, they will be:
"--epochs=" + EPOCHS
: The number of epochs for training.JOB_NAME = "custom_job_" + TIMESTAMP
MODEL_DIR = "{}/{}".format(BUCKET_NAME, JOB_NAME)
EPOCHS = 5
CMDARGS = [
"--epochs=" + str(EPOCHS),
]
Define your custom training job on Vertex AI.
Use the CustomTrainingJob
class to define the job, which takes the following parameters:
display_name
: The user-defined name of this training pipeline.script_path
: The local path to the training script.container_uri
: The URI of the training container image.requirements
: The list of Python package dependencies of the script.model_serving_container_image_uri
: The URI of a container that can serve predictions for your model — either a prebuilt container or a custom container.Use the run
function to start training, which takes the following parameters:
args
: The command line arguments to be passed to the Python script.replica_count
: The number of worker replicas.model_display_name
: The display name of the Model
if the script produces a managed Model
.machine_type
: The type of machine to use for training.accelerator_type
: The hardware accelerator type.accelerator_count
: The number of accelerators to attach to a worker replica.The run
function creates a training pipeline that trains and creates a Model
object. After the training pipeline completes, the run
function returns the Model
object.
You can read more about the CustomTrainingJob.run
API here
job = aiplatform.CustomTrainingJob(
display_name=JOB_NAME,
script_path="task.py",
container_uri=TRAIN_IMAGE,
requirements=["tensorflow_datasets==4.6.0"],
model_serving_container_image_uri=DEPLOY_IMAGE,
)
MODEL_DISPLAY_NAME = "fashionmnist-" + TIMESTAMP
# Start the training
if TRAIN_GPU:
model = job.run(
model_display_name=MODEL_DISPLAY_NAME,
args=CMDARGS,
replica_count=1,
machine_type=TRAIN_COMPUTE,
accelerator_type=TRAIN_GPU.name,
accelerator_count=TRAIN_NGPU,
)
else:
model = job.run(
model_display_name=MODEL_DISPLAY_NAME,
args=CMDARGS,
replica_count=1,
machine_type=TRAIN_COMPUTE,
accelerator_count=0,
)
Training script copied to:
gs://qwiklabs-gcp-01-cb944d7414a1-bucket/aiplatform-2024-03-22-16:40:15.235-aiplatform_custom_trainer_script-0.1.tar.gz.
Training Output directory:
gs://qwiklabs-gcp-01-cb944d7414a1-bucket/aiplatform-custom-training-2024-03-22-16:40:15.392
View Training:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/2899049254888669184?project=45654375613
CustomTrainingJob projects/45654375613/locations/us-central1/trainingPipelines/2899049254888669184 current state:
PipelineState.PIPELINE_STATE_PENDING
CustomTrainingJob projects/45654375613/locations/us-central1/trainingPipelines/2899049254888669184 current state:
PipelineState.PIPELINE_STATE_PENDING
CustomTrainingJob projects/45654375613/locations/us-central1/trainingPipelines/2899049254888669184 current state:
PipelineState.PIPELINE_STATE_PENDING
CustomTrainingJob projects/45654375613/locations/us-central1/trainingPipelines/2899049254888669184 current state:
PipelineState.PIPELINE_STATE_PENDING
View backing custom job:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/5303883494974291968?project=45654375613
CustomTrainingJob projects/45654375613/locations/us-central1/trainingPipelines/2899049254888669184 current state:
PipelineState.PIPELINE_STATE_RUNNING
CustomTrainingJob projects/45654375613/locations/us-central1/trainingPipelines/2899049254888669184 current state:
PipelineState.PIPELINE_STATE_RUNNING
CustomTrainingJob run completed. Resource name: projects/45654375613/locations/us-central1/trainingPipelines/2899049254888669184
Model available at projects/45654375613/locations/us-central1/models/5278008756557316096
The Training pipeline will take an average of 10 to 12 minutes to finish.
To view the training pipeline status, you have to navigate to Vertex AI ➞ Training please choose your region from the resouce panel in the training platform from the region
You can see the status of the current training pipeline as seen below
Once the model has been successfully trained, you can see a custom trained model if you head to Vertex AI ➞ Model Registry
Go back to lab manual and Click Check my progress under Train the Model on Vertex AI section to verify that the pipeline ran successfully.
Before you use your model to make predictions, you need to deploy it to an Endpoint
. You can do this by calling the deploy
function on the Model
resource. This will do two things:
Endpoint
resource for deploying the Model
resource to.Model
resource to the Endpoint
resource.The function takes the following parameters:
deployed_model_display_name
: A human readable name for the deployed model.traffic_split
: Percent of traffic at the endpoint that goes to this model, which is specified as a dictionary of one or more key/value pairs.
model_id
to specify as { “0”: percent, model_id: percent, … }, where model_id
is the model id of an existing model to the deployed endpoint. The percents must add up to 100.machine_type
: The type of machine to use for training.accelerator_type
: The hardware accelerator type.accelerator_count
: The number of accelerators to attach to a worker replica.starting_replica_count
: The number of compute instances to initially provision.max_replica_count
: The maximum number of compute instances to scale to. In this tutorial, only one instance is provisioned.The traffic_split
parameter is specified as a Python dictionary. You can deploy more than one instance of your model to an endpoint, and then set the percentage of traffic that goes to each instance.
You can use a traffic split to introduce a new model gradually into production. For example, if you had one existing model in production with 100% of the traffic, you could deploy a new model to the same endpoint, direct 10% of traffic to it, and reduce the original model’s traffic to 90%. This allows you to monitor the new model’s performance while minimizing the distruption to the majority of users.
You can specify a single instance (or node) to serve your online prediction requests. This tutorial uses a single node, so the variables MIN_NODES
and MAX_NODES
are both set to 1
.
If you want to use multiple nodes to serve your online prediction requests, set MAX_NODES
to the maximum number of nodes you want to use. Vertex AI autoscales the number of nodes used to serve your predictions, up to the maximum number you set. Refer to the pricing page to understand the costs of autoscaling with multiple nodes.
The method will block until the model is deployed and eventually return an Endpoint
object. If this is the first time a model is deployed to the endpoint, it may take a few additional minutes to complete provisioning of resources.
DEPLOYED_NAME = "fashionmnist_deployed-" + TIMESTAMP
TRAFFIC_SPLIT = {"0": 100}
MIN_NODES = 1
MAX_NODES = 1
if DEPLOY_GPU:
endpoint = model.deploy(
deployed_model_display_name=DEPLOYED_NAME,
traffic_split=TRAFFIC_SPLIT,
machine_type=DEPLOY_COMPUTE,
accelerator_type=DEPLOY_GPU.name,
accelerator_count=DEPLOY_NGPU,
min_replica_count=MIN_NODES,
max_replica_count=MAX_NODES,
)
else:
endpoint = model.deploy(
deployed_model_display_name=DEPLOYED_NAME,
traffic_split=TRAFFIC_SPLIT,
machine_type=DEPLOY_COMPUTE,
accelerator_type=None,
accelerator_count=0,
min_replica_count=MIN_NODES,
max_replica_count=MAX_NODES,
)
Creating Endpoint
Create Endpoint backing LRO: projects/45654375613/locations/us-central1/endpoints/4106150363185283072/operations/1516086931086114816
Endpoint created. Resource name: projects/45654375613/locations/us-central1/endpoints/4106150363185283072
To use this Endpoint in another session:
endpoint = aiplatform.Endpoint('projects/45654375613/locations/us-central1/endpoints/4106150363185283072')
Deploying model to Endpoint : projects/45654375613/locations/us-central1/endpoints/4106150363185283072
Deploy Endpoint model backing LRO: projects/45654375613/locations/us-central1/endpoints/4106150363185283072/operations/543309411574087680
Endpoint model deployed. Resource name: projects/45654375613/locations/us-central1/endpoints/4106150363185283072
The Deployment process will take an average of 15 to 17 minutes to finish.
In order to view your deployed endpoint, you can head over to Vertex AI ➞ Online prediction
You can check if your endpoint is in the list of the currently deployed/deploying endpoints.
To view the details of the endpoint that is currently being deployed, you can simply click on the endpoint name.
Once deployment is successful, the status will change to Ready in the above screenshot.
Go back to lab manual and Click Check my progress under Deploy the Model section to verify that the deployment was a success.
You will now see how an online prediction request can be sent to your deployed model.
First, you will load the test split of the Fashion MNIST dataset using tfds
.
Set the batch size to -1 to load the entire dataset.
import tensorflow_datasets as tfds
import numpy as np
tfds.disable_progress_bar()
datasets, info = tfds.load('fashion_mnist', batch_size=-1, with_info=True, as_supervised=True)
test_dataset = datasets['test']
2024-03-22 17:08:30.846380: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-22 17:08:34.906193: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-22 17:08:39.251268: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-22 17:08:50.656143: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[1mDownloading and preparing dataset 29.45 MiB (download: 29.45 MiB, generated: 36.42 MiB, total: 65.87 MiB) to /home/jupyter/tensorflow_datasets/fashion_mnist/3.0.1...[0m
[1mDataset fashion_mnist downloaded and prepared to /home/jupyter/tensorflow_datasets/fashion_mnist/3.0.1. Subsequent calls will reuse this data.[0m
Now you will convert the dataset to NumPy arrays (images, labels). The images used for training the model were normalized. Hence the trained model expects normalized images for prediction. You will normalize the test images by dividing each pixel by 255.
x_test, y_test = tfds.as_numpy(test_dataset)
# Normalize (rescale) the pixel data by dividing each pixel by 255.
x_test = x_test.astype('float32') / 255.
Ensure the shapes are correct.
x_test.shape, y_test.shape
((10000, 28, 28, 1), (10000,))
#@title Pick the number of test images
NUM_TEST_IMAGES = 20 #@param {type:"slider", min:1, max:20, step:1}
x_test, y_test = x_test[:NUM_TEST_IMAGES], y_test[:NUM_TEST_IMAGES]
Now that you have prepared the test images, you can use them to send a prediction request. Use the Endpoint
object’s predict
function, which takes the following parameters:
instances
: A list of image instances. According to your custom model, each image instance should be a 3-dimensional matrix of floats. This was prepared in the previous step.The predict
function returns a list, where each element in the list corresponds to the corresponding image in the request. The ouput for each prediction will contain:
predictions
), between 0 and 1, for each of the ten classes.Now you can run a quick evaluation on the prediction results:
np.argmax
: Convert each list of confidence levels to a labelaccuracy
as no: of correct predictions/total no: of predictions
predictions = endpoint.predict(instances=x_test.tolist())
y_predicted = np.argmax(predictions.predictions, axis=1)
correct = sum(y_predicted == np.array(y_test.tolist()))
total = len(y_predicted)
print(
f"Correct predictions = {correct}, Total predictions = {total}, Accuracy = {correct/total}"
)
Correct predictions = 18, Total predictions = 20, Accuracy = 0.9