Coursera

Ungraded Lab (Optional): ETL Pipelines and Batch Predictions with Apache Beam and Tensorflow

In this lab, you will create, train, evaluate, and make predictions on a model using Apache Beam and TensorFlow. In particular, you will train a model to predict the molecular energy based on the number of carbon, hydrogen, oxygen, and nitrogen atoms.

This lab is marked as optional because you will not be interacting with Beam-based systems directly in future exercises. Other courses of this specialization also use tools that abstract this layer. Nonetheless, it would be good to be familiar with it since it is used under the hood by TFX which is the main ML pipelines framework that you will use in other labs. Seeing how these systems work will let you explore other codebases that use this tool more freely and even make contributions or bug fixes as you see fit. If you don’t know the basics of Beam yet, we encourage you to look at the Minimal Word Count example here for a quick start and use the Beam Programming Guide to look up concepts if needed.

The entire pipeline can be divided into four phases:

Data extraction
Preprocessing the data
Training the model
Doing predictions

You will focus particularly on Phase 2 (Preprocessing) and a bit of Phase 4 (Predictions) because these use Beam in its implementation.

Let’s begin!

Note: This tutorial uses code, images, and discussion from this article. We highlighted a few key parts and updated some of the code to use more recent versions. Also, we focused on making the lab running locally. The original article linked above contain instructions on running it in GCP. Just take note that it will have associated costs depending on the resources you use.

Initial setup

You will first setup the environment and download the scripts that you will use in the lab.

# NOTE: Some of the packages used in this lab are not compatible with the
# default Python version in Colab (Python3.8 as of January 2023).
# The commands below will setup the environment to use Python3.7 instead.
# Please *DO NOT* restart the runtime after running these.

# Install packages needed to downgrade to Python3.7
!apt-get install python3.7 python3.7-distutils python3-pip

# Configure the Colab environment to use Python3.7 by default
!update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.7 100

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  libpython3.7-minimal libpython3.7-stdlib mailcap mime-support
  python3-setuptools python3-wheel python3.7-lib2to3 python3.7-minimal
Suggested packages:
  python-setuptools-doc python3.7-venv binfmt-support
The following NEW packages will be installed:
  libpython3.7-minimal libpython3.7-stdlib mailcap mime-support python3-pip
  python3-setuptools python3-wheel python3.7 python3.7-distutils
  python3.7-lib2to3 python3.7-minimal
0 upgraded, 11 newly installed, 0 to remove and 16 not upgraded.
Need to get 6,688 kB of archives.
After this operation, 28.0 MB of additional disk space will be used.
Get:1 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy/main amd64 libpython3.7-minimal amd64 3.7.17-1+jammy1 [608 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/main amd64 mailcap all 3.70+nmu1ubuntu1 [23.8 kB]
Get:3 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy/main amd64 python3.7-minimal amd64 3.7.17-1+jammy1 [1,837 kB]
Get:4 http://archive.ubuntu.com/ubuntu jammy/main amd64 mime-support all 3.66 [3,696 B]
Get:5 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 python3-setuptools all 59.6.0-1.2ubuntu0.22.04.1 [339 kB]
Get:6 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy/main amd64 libpython3.7-stdlib amd64 3.7.17-1+jammy1 [1,864 kB]
Get:7 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy/main amd64 python3.7 amd64 3.7.17-1+jammy1 [362 kB]
Get:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy/main amd64 python3.7-lib2to3 all 3.7.17-1+jammy1 [124 kB]
Get:9 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 python3-wheel all 0.37.1-2ubuntu0.22.04.1 [32.0 kB]
Get:10 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 python3-pip all 22.0.2+dfsg-1ubuntu0.3 [1,305 kB]
Get:11 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy/main amd64 python3.7-distutils all 3.7.17-1+jammy1 [189 kB]
Fetched 6,688 kB in 1s (7,477 kB/s)
Selecting previously unselected package libpython3.7-minimal:amd64.
(Reading database ... 120831 files and directories currently installed.)
Preparing to unpack .../00-libpython3.7-minimal_3.7.17-1+jammy1_amd64.deb ...
Unpacking libpython3.7-minimal:amd64 (3.7.17-1+jammy1) ...
Selecting previously unselected package python3.7-minimal.
Preparing to unpack .../01-python3.7-minimal_3.7.17-1+jammy1_amd64.deb ...
Unpacking python3.7-minimal (3.7.17-1+jammy1) ...
Selecting previously unselected package mailcap.
Preparing to unpack .../02-mailcap_3.70+nmu1ubuntu1_all.deb ...
Unpacking mailcap (3.70+nmu1ubuntu1) ...
Selecting previously unselected package mime-support.
Preparing to unpack .../03-mime-support_3.66_all.deb ...
Unpacking mime-support (3.66) ...
Selecting previously unselected package libpython3.7-stdlib:amd64.
Preparing to unpack .../04-libpython3.7-stdlib_3.7.17-1+jammy1_amd64.deb ...
Unpacking libpython3.7-stdlib:amd64 (3.7.17-1+jammy1) ...
Selecting previously unselected package python3-setuptools.
Preparing to unpack .../05-python3-setuptools_59.6.0-1.2ubuntu0.22.04.1_all.deb ...
Unpacking python3-setuptools (59.6.0-1.2ubuntu0.22.04.1) ...
Selecting previously unselected package python3-wheel.
Preparing to unpack .../06-python3-wheel_0.37.1-2ubuntu0.22.04.1_all.deb ...
Unpacking python3-wheel (0.37.1-2ubuntu0.22.04.1) ...
Selecting previously unselected package python3-pip.
Preparing to unpack .../07-python3-pip_22.0.2+dfsg-1ubuntu0.3_all.deb ...
Unpacking python3-pip (22.0.2+dfsg-1ubuntu0.3) ...
Selecting previously unselected package python3.7.
Preparing to unpack .../08-python3.7_3.7.17-1+jammy1_amd64.deb ...
Unpacking python3.7 (3.7.17-1+jammy1) ...
Selecting previously unselected package python3.7-lib2to3.
Preparing to unpack .../09-python3.7-lib2to3_3.7.17-1+jammy1_all.deb ...
Unpacking python3.7-lib2to3 (3.7.17-1+jammy1) ...
Selecting previously unselected package python3.7-distutils.
Preparing to unpack .../10-python3.7-distutils_3.7.17-1+jammy1_all.deb ...
Unpacking python3.7-distutils (3.7.17-1+jammy1) ...
Setting up python3-setuptools (59.6.0-1.2ubuntu0.22.04.1) ...
Setting up libpython3.7-minimal:amd64 (3.7.17-1+jammy1) ...
Setting up python3-wheel (0.37.1-2ubuntu0.22.04.1) ...
Setting up python3-pip (22.0.2+dfsg-1ubuntu0.3) ...
Setting up python3.7-minimal (3.7.17-1+jammy1) ...
Setting up python3.7-lib2to3 (3.7.17-1+jammy1) ...
Setting up mailcap (3.70+nmu1ubuntu1) ...
Setting up mime-support (3.66) ...
Setting up python3.7-distutils (3.7.17-1+jammy1) ...
Setting up libpython3.7-stdlib:amd64 (3.7.17-1+jammy1) ...
Setting up python3.7 (3.7.17-1+jammy1) ...
Processing triggers for man-db (2.10.2-1) ...
update-alternatives: using /usr/bin/python3.7 to provide /usr/bin/python3 (python3) in auto mode

# Download the scripts
!wget https://github.com/https-deeplearning-ai/machine-learning-engineering-for-production-public/raw/main/course4/week2-ungraded-labs/C4_W2_Lab_4_ETL_Beam/data/molecules.tar.gz

# Unzip the archive
!tar -xvzf molecules.tar.gz

--2023-08-30 04:46:22--  https://github.com/https-deeplearning-ai/machine-learning-engineering-for-production-public/raw/main/course4/week2-ungraded-labs/C4_W2_Lab_4_ETL_Beam/data/molecules.tar.gz
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/https-deeplearning-ai/machine-learning-engineering-for-production-public/main/course4/week2-ungraded-labs/C4_W2_Lab_4_ETL_Beam/data/molecules.tar.gz [following]
--2023-08-30 04:46:23--  https://raw.githubusercontent.com/https-deeplearning-ai/machine-learning-engineering-for-production-public/main/course4/week2-ungraded-labs/C4_W2_Lab_4_ETL_Beam/data/molecules.tar.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10603 (10K) [application/octet-stream]
Saving to: ‘molecules.tar.gz’

molecules.tar.gz    100%[===================>]  10.35K  --.-KB/s    in 0s      

2023-08-30 04:46:23 (99.0 MB/s) - ‘molecules.tar.gz’ saved [10603/10603]

data-extractor.py
predict.py
preprocess.py
pubchem/
pubchem/__init__.py
pubchem/sdf.py
pubchem/pipeline.py
trainer/
trainer/task.py
trainer/__init__.py

The molecules directory you downloaded mainly contain 4 scripts that encapsulate all phases of the workflow you will execute in this lab. It is summarized by the figure below:

https://github.com/https-deeplearning-ai/machine-learning-engineering-for-production-public/raw/main/course4/week2-ungraded-labs/C4_W2_Lab_4_ETL_Beam/images/overview.png

It also includes the pubchem subdirectory which contains common modules (i.e. pipeline.py and sdf.py) shared by the preprocessing and predicition phases. If you look at preprocess.py and predict.py, you can see the line import as pubchem at the top.

You will then install some packages needed in this notebook. You will use the Apache Beam version bundled with Tensorflow Transform. You can safely ignore the incompatibility error shown after installing dill.

# IMPORTANT NOTE: Please DO NOT restart the runtime after running the commands
# below. Doing so might cause issues because the Python runtime was switched.

# Install packages
!pip install tensorflow-transform==1.12
!pip install dill==0.3.3

Collecting tensorflow-transform==1.12
  Downloading tensorflow_transform-1.12.0-py3-none-any.whl (439 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m439.8/439.8 KB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyarrow<7,>=6
  Downloading pyarrow-6.0.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m25.6/25.6 MB[0m [31m49.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting apache-beam[gcp]<3,>=2.41
  Downloading apache_beam-2.48.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.5/13.5 MB[0m [31m87.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting absl-py<2.0.0,>=0.9
  Downloading absl_py-1.4.0-py3-none-any.whl (126 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.5/126.5 KB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting numpy<2,>=1.16
  Downloading numpy-1.21.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.7/15.7 MB[0m [31m43.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tensorflow-metadata<1.13.0,>=1.12.0
  Downloading tensorflow_metadata-1.12.0-py3-none-any.whl (52 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.3/52.3 KB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pydot<2,>=1.2
  Downloading pydot-1.4.2-py2.py3-none-any.whl (21 kB)
Collecting tensorflow<2.12,>=2.11.0
  Downloading tensorflow-2.11.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (588.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m588.3/588.3 MB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting protobuf<4,>=3.13
  Downloading protobuf-3.20.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m69.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tfx-bsl<1.13.0,>=1.12.0
  Downloading tfx_bsl-1.12.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (21.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.6/21.6 MB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting objsize<0.7.0,>=0.6.1
  Downloading objsize-0.6.1-py3-none-any.whl (9.3 kB)
Collecting python-dateutil<3,>=2.8.0
  Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m247.7/247.7 KB[0m [31m26.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pymongo<5.0.0,>=3.8.0
  Downloading pymongo-4.5.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (653 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m653.7/653.7 KB[0m [31m54.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting crcmod<2.0,>=1.7
  Downloading crcmod-1.7.tar.gz (89 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.7/89.7 KB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dill<0.3.2,>=0.3.1.1
  Downloading dill-0.3.1.1.tar.gz (151 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m152.0/152.0 KB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting zstandard<1,>=0.18.0
  Downloading zstandard-0.21.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.7/2.7 MB[0m [31m74.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting orjson<4.0
  Downloading orjson-3.9.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (139 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.7/139.7 KB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pytz>=2018.3
  Downloading pytz-2023.3-py2.py3-none-any.whl (502 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m502.3/502.3 KB[0m [31m33.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting httplib2<0.23.0,>=0.8
  Downloading httplib2-0.22.0-py3-none-any.whl (96 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m96.9/96.9 KB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting proto-plus<2,>=1.7.1
  Downloading proto_plus-1.22.3-py3-none-any.whl (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.1/48.1 KB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting hdfs<3.0.0,>=2.1.0
  Downloading hdfs-2.7.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.4/43.4 KB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting regex>=2020.6.8
  Downloading regex-2023.8.8-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (758 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m759.0/759.0 KB[0m [31m62.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cloudpickle~=2.2.1
  Downloading cloudpickle-2.2.1-py3-none-any.whl (25 kB)
Collecting fastavro<2,>=0.23.6
  Downloading fastavro-1.8.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m92.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting typing-extensions>=3.7.0
  Downloading typing_extensions-4.7.1-py3-none-any.whl (33 kB)
Collecting fasteners<1.0,>=0.3
  Downloading fasteners-0.18-py3-none-any.whl (18 kB)
Collecting requests<3.0.0,>=2.24.0
  Downloading requests-2.31.0-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.6/62.6 KB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting grpcio!=1.48.0,<2,>=1.33.1
  Downloading grpcio-1.57.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.3/5.3 MB[0m [31m102.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-cloud-recommendations-ai<0.11.0,>=0.1.0
  Downloading google_cloud_recommendations_ai-0.10.4-py2.py3-none-any.whl (173 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m173.3/173.3 KB[0m [31m22.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-cloud-bigquery<4,>=2.0.0
  Downloading google_cloud_bigquery-3.11.4-py2.py3-none-any.whl (219 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m219.6/219.6 KB[0m [31m27.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-cloud-vision<4,>=2
  Downloading google_cloud_vision-3.4.4-py2.py3-none-any.whl (444 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m444.0/444.0 KB[0m [31m43.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-cloud-language<3,>=2.0
  Downloading google_cloud_language-2.11.0-py2.py3-none-any.whl (138 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m138.7/138.7 KB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-cloud-dlp<4,>=3.0.0
  Downloading google_cloud_dlp-3.12.2-py2.py3-none-any.whl (143 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.4/143.4 KB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-cloud-bigtable<2.18.0,>=2.0.0
  Downloading google_cloud_bigtable-2.17.0-py2.py3-none-any.whl (288 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m288.6/288.6 KB[0m [31m31.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-cloud-core<3,>=2.0.0
  Downloading google_cloud_core-2.3.3-py2.py3-none-any.whl (29 kB)
Collecting google-cloud-pubsublite<2,>=1.2.0
  Downloading google_cloud_pubsublite-1.7.0-py2.py3-none-any.whl (273 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m273.9/273.9 KB[0m [31m31.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cachetools<6,>=3.1.0
  Downloading cachetools-5.3.1-py3-none-any.whl (9.3 kB)
Collecting google-auth-httplib2<0.2.0,>=0.1.0
  Downloading google_auth_httplib2-0.1.0-py2.py3-none-any.whl (9.3 kB)
Collecting google-cloud-bigquery-storage<3,>=2.6.3
  Downloading google_cloud_bigquery_storage-2.22.0-py2.py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.9/190.9 KB[0m [31m21.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-cloud-pubsub<3,>=2.1.0
  Downloading google_cloud_pubsub-2.18.3-py2.py3-none-any.whl (265 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.9/265.9 KB[0m [31m30.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-apitools<0.5.32,>=0.5.31
  Downloading google-apitools-0.5.31.tar.gz (173 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m173.5/173.5 KB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting google-cloud-datastore<3,>=2.0.0
  Downloading google_cloud_datastore-2.17.0-py2.py3-none-any.whl (177 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.1/177.1 KB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-cloud-spanner<4,>=3.0.0
  Downloading google_cloud_spanner-3.40.1-py2.py3-none-any.whl (332 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m332.9/332.9 KB[0m [31m38.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-auth<3,>=1.18.0
  Downloading google_auth-2.22.0-py2.py3-none-any.whl (181 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m181.8/181.8 KB[0m [31m24.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-cloud-videointelligence<3,>=2.0
  Downloading google_cloud_videointelligence-2.11.3-py2.py3-none-any.whl (229 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m229.4/229.4 KB[0m [31m29.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyparsing>=2.1.4
  Downloading pyparsing-3.1.1-py3-none-any.whl (103 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m103.1/103.1 KB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting packaging
  Downloading packaging-23.1-py3-none-any.whl (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.9/48.9 KB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hRequirement already satisfied: setuptools in /usr/lib/python3/dist-packages (from tensorflow<2.12,>=2.11.0->tensorflow-transform==1.12) (59.6.0)
Collecting opt-einsum>=2.3.2
  Downloading opt_einsum-3.3.0-py3-none-any.whl (65 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.5/65.5 KB[0m [31m669.8 kB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-pasta>=0.1.1
  Downloading google_pasta-0.2.0-py3-none-any.whl (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.5/57.5 KB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting termcolor>=1.1.0
  Downloading termcolor-2.3.0-py3-none-any.whl (6.9 kB)
Collecting tensorboard<2.12,>=2.11
  Downloading tensorboard-2.11.2-py3-none-any.whl (6.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.0/6.0 MB[0m [31m106.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting astunparse>=1.6.0
  Downloading astunparse-1.6.3-py2.py3-none-any.whl (12 kB)
Collecting tensorflow-io-gcs-filesystem>=0.23.1
  Downloading tensorflow_io_gcs_filesystem-0.33.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (2.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m74.1 MB/s[0m eta [36m0:00:00[0m
[?25hINFO: pip is looking at multiple versions of pydot to determine which version is compatible with other requirements. This could take a while.
Collecting pydot<2,>=1.2
  Downloading pydot-1.4.1-py2.py3-none-any.whl (19 kB)
INFO: pip is looking at multiple versions of pyarrow to determine which version is compatible with other requirements. This could take a while.
Collecting pyarrow<7,>=6
  Downloading pyarrow-6.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m25.5/25.5 MB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
[?25hINFO: pip is looking at multiple versions of protobuf to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of numpy to determine which version is compatible with other requirements. This could take a while.
Collecting numpy<2,>=1.16
  Downloading numpy-1.21.5-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.7/15.7 MB[0m [31m83.5 MB/s[0m eta [36m0:00:00[0m
[?25hINFO: pip is looking at multiple versions of apache-beam[gcp] to determine which version is compatible with other requirements. This could take a while.
Collecting apache-beam[gcp]<3,>=2.41
  Downloading apache_beam-2.47.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.5/13.5 MB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting httplib2<0.22.0,>=0.8
  Downloading httplib2-0.21.0-py3-none-any.whl (96 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m96.8/96.8 KB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cachetools<5,>=3.1.0
  Downloading cachetools-4.2.4-py3-none-any.whl (10 kB)
Collecting google-cloud-bigtable<3,>=2.0.0
  Downloading google_cloud_bigtable-2.21.0-py2.py3-none-any.whl (293 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m293.0/293.0 KB[0m [31m27.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting apache-beam[gcp]<3,>=2.41
  Downloading apache_beam-2.46.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.4/13.4 MB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pymongo<4.0.0,>=3.8.0
  Downloading pymongo-3.13.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (506 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m506.0/506.0 KB[0m [31m44.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-cloud-videointelligence<2,>=1.8.0
  Downloading google_cloud_videointelligence-1.16.3-py2.py3-none-any.whl (183 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.9/183.9 KB[0m [31m24.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-cloud-language<2,>=1.3.0
  Downloading google_cloud_language-1.3.2-py2.py3-none-any.whl (83 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.6/83.6 KB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-cloud-recommendations-ai<0.8.0,>=0.1.0
  Downloading google_cloud_recommendations_ai-0.7.1-py2.py3-none-any.whl (148 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m148.2/148.2 KB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-cloud-bigtable<2,>=0.31.1
  Downloading google_cloud_bigtable-1.7.3-py2.py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.7/268.7 KB[0m [31m33.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-cloud-bigquery-storage<2.17,>=2.6.3
  Downloading google_cloud_bigquery_storage-2.16.2-py2.py3-none-any.whl (185 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m185.4/185.4 KB[0m [31m25.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-cloud-datastore<2,>=1.8.0
  Downloading google_cloud_datastore-1.15.5-py2.py3-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.2/134.2 KB[0m [31m20.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting protobuf<4,>=3.13
  Downloading protobuf-3.19.6-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m77.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting flatbuffers>=2.0
  Downloading flatbuffers-23.5.26-py2.py3-none-any.whl (26 kB)
Collecting keras<2.12,>=2.11.0
  Downloading keras-2.11.0-py2.py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m93.2 MB/s[0m eta [36m0:00:00[0m
[?25hRequirement already satisfied: six>=1.12.0 in /usr/lib/python3/dist-packages (from tensorflow<2.12,>=2.11.0->tensorflow-transform==1.12) (1.16.0)
Collecting gast<=0.4.0,>=0.2.1
  Downloading gast-0.4.0-py3-none-any.whl (9.8 kB)
Collecting h5py>=2.9.0
  Downloading h5py-3.8.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.3/4.3 MB[0m [31m108.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tensorflow-estimator<2.12,>=2.11.0
  Downloading tensorflow_estimator-2.11.0-py2.py3-none-any.whl (439 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m439.2/439.2 KB[0m [31m45.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting libclang>=13.0.0
  Downloading libclang-16.0.6-py2.py3-none-manylinux2010_x86_64.whl (22.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m22.9/22.9 MB[0m [31m74.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting wrapt>=1.11.0
  Downloading wrapt-1.15.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.7/75.7 KB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting googleapis-common-protos<2,>=1.52.0
  Downloading googleapis_common_protos-1.60.0-py2.py3-none-any.whl (227 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.6/227.6 KB[0m [31m24.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pandas<2,>=1.0
  Downloading pandas-1.3.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.3/11.3 MB[0m [31m109.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tensorflow-serving-api!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,<3,>=1.15
  Downloading tensorflow_serving_api-2.11.1-py2.py3-none-any.whl (37 kB)
Collecting google-api-python-client<2,>=1.7.11
  Downloading google_api_python_client-1.12.11-py2.py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.1/62.1 KB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25hRequirement already satisfied: wheel<1.0,>=0.23.0 in /usr/lib/python3/dist-packages (from astunparse>=1.6.0->tensorflow<2.12,>=2.11.0->tensorflow-transform==1.12) (0.37.1)
Collecting google-api-core<3dev,>=1.21.0
  Downloading google_api_core-2.11.1-py3-none-any.whl (120 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m120.5/120.5 KB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting uritemplate<4dev,>=3.0.0
  Downloading uritemplate-3.0.1-py2.py3-none-any.whl (15 kB)
Collecting oauth2client>=1.4.12
  Downloading oauth2client-4.1.3-py2.py3-none-any.whl (98 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.2/98.2 KB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyasn1-modules>=0.2.1
  Downloading pyasn1_modules-0.3.0-py2.py3-none-any.whl (181 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m181.3/181.3 KB[0m [31m24.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting urllib3<2.0
  Downloading urllib3-1.26.16-py2.py3-none-any.whl (143 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.1/143.1 KB[0m [31m19.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rsa<5,>=3.1.4
  Downloading rsa-4.9-py3-none-any.whl (34 kB)
Collecting google-resumable-media<3.0dev,>=0.6.0
  Downloading google_resumable_media-2.5.0-py2.py3-none-any.whl (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.7/77.7 KB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting grpc-google-iam-v1<0.13dev,>=0.12.3
  Downloading grpc_google_iam_v1-0.12.6-py2.py3-none-any.whl (26 kB)
Collecting grpcio-status>=1.33.2
  Downloading grpcio_status-1.57.0-py3-none-any.whl (5.1 kB)
Collecting overrides<7.0.0,>=6.0.1
  Downloading overrides-6.5.0-py3-none-any.whl (17 kB)
Collecting sqlparse>=0.4.4
  Downloading sqlparse-0.4.4-py3-none-any.whl (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.2/41.2 KB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting docopt
  Downloading docopt-0.6.2.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting charset-normalizer<4,>=2
  Downloading charset_normalizer-3.2.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (175 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m175.8/175.8 KB[0m [31m23.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting certifi>=2017.4.17
  Downloading certifi-2023.7.22-py3-none-any.whl (158 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m158.3/158.3 KB[0m [31m20.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting idna<4,>=2.5
  Downloading idna-3.4-py3-none-any.whl (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.5/61.5 KB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tensorboard-data-server<0.7.0,>=0.6.0
  Downloading tensorboard_data_server-0.6.1-py3-none-manylinux2010_x86_64.whl (4.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.9/4.9 MB[0m [31m108.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting markdown>=2.6.8
  Downloading Markdown-3.4.4-py3-none-any.whl (94 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.2/94.2 KB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting werkzeug>=1.0.1
  Downloading Werkzeug-2.2.3-py3-none-any.whl (233 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.6/233.6 KB[0m [31m27.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-auth-oauthlib<0.5,>=0.4.1
  Downloading google_auth_oauthlib-0.4.6-py2.py3-none-any.whl (18 kB)
Collecting tensorboard-plugin-wit>=1.6.0
  Downloading tensorboard_plugin_wit-1.8.1-py3-none-any.whl (781 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m781.3/781.3 KB[0m [31m61.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tensorflow-serving-api!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,<3,>=1.15
  Downloading tensorflow_serving_api-2.11.0-py2.py3-none-any.whl (37 kB)
Collecting requests-oauthlib>=0.7.0
  Downloading requests_oauthlib-1.3.1-py2.py3-none-any.whl (23 kB)
Collecting google-crc32c<2.0dev,>=1.0
  Downloading google_crc32c-1.5.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (32 kB)
Collecting grpcio-status>=1.33.2
  Downloading grpcio_status-1.56.2-py3-none-any.whl (5.1 kB)
  Downloading grpcio_status-1.56.0-py3-none-any.whl (5.1 kB)
  Downloading grpcio_status-1.55.3-py3-none-any.whl (5.1 kB)
  Downloading grpcio_status-1.54.3-py3-none-any.whl (5.1 kB)
  Downloading grpcio_status-1.54.2-py3-none-any.whl (5.1 kB)
  Downloading grpcio_status-1.54.0-py3-none-any.whl (5.1 kB)
  Downloading grpcio_status-1.53.2-py3-none-any.whl (5.1 kB)
  Downloading grpcio_status-1.53.1-py3-none-any.whl (5.1 kB)
  Downloading grpcio_status-1.53.0-py3-none-any.whl (5.1 kB)
  Downloading grpcio_status-1.51.3-py3-none-any.whl (5.1 kB)
  Downloading grpcio_status-1.51.1-py3-none-any.whl (5.1 kB)
  Downloading grpcio_status-1.50.0-py3-none-any.whl (14 kB)
  Downloading grpcio_status-1.49.1-py3-none-any.whl (14 kB)
  Downloading grpcio_status-1.48.2-py3-none-any.whl (14 kB)
Collecting importlib-metadata>=4.4
  Downloading importlib_metadata-6.7.0-py3-none-any.whl (22 kB)
Collecting pyasn1>=0.1.7
  Downloading pyasn1-0.5.0-py2.py3-none-any.whl (83 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.9/83.9 KB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting MarkupSafe>=2.1.1
  Downloading MarkupSafe-2.1.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB)
Collecting zipp>=0.5
  Downloading zipp-3.15.0-py3-none-any.whl (6.8 kB)
Collecting oauthlib>=3.0.0
  Downloading oauthlib-3.2.2-py3-none-any.whl (151 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m151.7/151.7 KB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: crcmod, dill, google-apitools, hdfs, docopt
  Building wheel for crcmod (setup.py) ... [?25l[?25hdone
  Created wheel for crcmod: filename=crcmod-1.7-py3-none-any.whl size=18850 sha256=8378f38ffd63e79b51bb7ecd940dad615dc9a2b4cc121bf0f5eb1eaba61dd8db
  Stored in directory: /root/.cache/pip/wheels/dc/9a/e9/49e627353476cec8484343c4ab656f1e0d783ee77b9dde2d1f
  Building wheel for dill (setup.py) ... [?25l[?25hdone
  Created wheel for dill: filename=dill-0.3.1.1-py3-none-any.whl size=78544 sha256=66e33f528bc520a52c1bfbc624a413a62edc87eb8fcdcea68908a5b1fbd85a2d
  Stored in directory: /root/.cache/pip/wheels/a4/61/fd/c57e374e580aa78a45ed78d5859b3a44436af17e22ca53284f
  Building wheel for google-apitools (setup.py) ... [?25l[?25hdone
  Created wheel for google-apitools: filename=google_apitools-0.5.31-py3-none-any.whl size=131040 sha256=3b45a2ad6ddaf5dd0b99ca1887f1b6848069524b49144d71c364d7d7caedeaab
  Stored in directory: /root/.cache/pip/wheels/19/b5/2f/1cc3cf2b31e7a9cd1508731212526d9550271274d351c96f16
  Building wheel for hdfs (setup.py) ... [?25l[?25hdone
  Created wheel for hdfs: filename=hdfs-2.7.2-py3-none-any.whl size=34191 sha256=984d921e31ce51f65d86cd82886f87e772ca6d49b84598a02e5d0a2024e7cbd1
  Stored in directory: /root/.cache/pip/wheels/e2/33/27/5b3de59cfbdc8df69a691254cf86847b733db9db78b07c6704
  Building wheel for docopt (setup.py) ... [?25l[?25hdone
  Created wheel for docopt: filename=docopt-0.6.2-py2.py3-none-any.whl size=13723 sha256=f49424aa677829372a04027263cd07f7786bfc8d546402d1909ab3d533234d7c
  Stored in directory: /root/.cache/pip/wheels/72/b0/3f/1d95f96ff986c7dfffe46ce2be4062f38ebd04b506c77c81b9
Successfully built crcmod dill google-apitools hdfs docopt
Installing collected packages: tensorboard-plugin-wit, pytz, libclang, flatbuffers, docopt, crcmod, zstandard, zipp, wrapt, urllib3, uritemplate, typing-extensions, termcolor, tensorflow-io-gcs-filesystem, tensorflow-estimator, tensorboard-data-server, sqlparse, regex, python-dateutil, pyparsing, pymongo, pyasn1, protobuf, packaging, overrides, orjson, objsize, oauthlib, numpy, MarkupSafe, keras, idna, grpcio, google-pasta, google-crc32c, gast, fasteners, fastavro, dill, cloudpickle, charset-normalizer, certifi, cachetools, astunparse, absl-py, werkzeug, rsa, requests, pydot, pyasn1-modules, pyarrow, proto-plus, pandas, opt-einsum, importlib-metadata, httplib2, h5py, googleapis-common-protos, google-resumable-media, tensorflow-metadata, requests-oauthlib, oauth2client, markdown, hdfs, grpcio-status, google-auth, grpc-google-iam-v1, google-auth-oauthlib, google-auth-httplib2, google-apitools, google-api-core, apache-beam, tensorboard, google-cloud-core, google-api-python-client, tensorflow, google-cloud-vision, google-cloud-videointelligence, google-cloud-spanner, google-cloud-recommendations-ai, google-cloud-pubsub, google-cloud-language, google-cloud-dlp, google-cloud-datastore, google-cloud-bigtable, google-cloud-bigquery-storage, google-cloud-bigquery, tensorflow-serving-api, google-cloud-pubsublite, tfx-bsl, tensorflow-transform
Successfully installed MarkupSafe-2.1.3 absl-py-1.4.0 apache-beam-2.46.0 astunparse-1.6.3 cachetools-4.2.4 certifi-2023.7.22 charset-normalizer-3.2.0 cloudpickle-2.2.1 crcmod-1.7 dill-0.3.1.1 docopt-0.6.2 fastavro-1.8.0 fasteners-0.18 flatbuffers-23.5.26 gast-0.4.0 google-api-core-2.11.1 google-api-python-client-1.12.11 google-apitools-0.5.31 google-auth-2.22.0 google-auth-httplib2-0.1.0 google-auth-oauthlib-0.4.6 google-cloud-bigquery-3.11.4 google-cloud-bigquery-storage-2.16.2 google-cloud-bigtable-1.7.3 google-cloud-core-2.3.3 google-cloud-datastore-1.15.5 google-cloud-dlp-3.12.2 google-cloud-language-1.3.2 google-cloud-pubsub-2.18.3 google-cloud-pubsublite-1.7.0 google-cloud-recommendations-ai-0.7.1 google-cloud-spanner-3.40.1 google-cloud-videointelligence-1.16.3 google-cloud-vision-3.4.4 google-crc32c-1.5.0 google-pasta-0.2.0 google-resumable-media-2.5.0 googleapis-common-protos-1.60.0 grpc-google-iam-v1-0.12.6 grpcio-1.57.0 grpcio-status-1.48.2 h5py-3.8.0 hdfs-2.7.2 httplib2-0.21.0 idna-3.4 importlib-metadata-6.7.0 keras-2.11.0 libclang-16.0.6 markdown-3.4.4 numpy-1.21.6 oauth2client-4.1.3 oauthlib-3.2.2 objsize-0.6.1 opt-einsum-3.3.0 orjson-3.9.5 overrides-6.5.0 packaging-23.1 pandas-1.3.5 proto-plus-1.22.3 protobuf-3.19.6 pyarrow-6.0.1 pyasn1-0.5.0 pyasn1-modules-0.3.0 pydot-1.4.2 pymongo-3.13.0 pyparsing-3.1.1 python-dateutil-2.8.2 pytz-2023.3 regex-2023.8.8 requests-2.31.0 requests-oauthlib-1.3.1 rsa-4.9 sqlparse-0.4.4 tensorboard-2.11.2 tensorboard-data-server-0.6.1 tensorboard-plugin-wit-1.8.1 tensorflow-2.11.0 tensorflow-estimator-2.11.0 tensorflow-io-gcs-filesystem-0.33.0 tensorflow-metadata-1.12.0 tensorflow-serving-api-2.11.0 tensorflow-transform-1.12.0 termcolor-2.3.0 tfx-bsl-1.12.0 typing-extensions-4.7.1 uritemplate-3.0.1 urllib3-1.26.16 werkzeug-2.2.3 wrapt-1.15.0 zipp-3.15.0 zstandard-0.21.0
[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv[0m[33m
[0m



Collecting dill==0.3.3
  Downloading dill-0.3.3-py2.py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.3/81.3 KB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dill
  Attempting uninstall: dill
    Found existing installation: dill 0.3.1.1
    Uninstalling dill-0.3.1.1:
      Successfully uninstalled dill-0.3.1.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
apache-beam 2.46.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.3 which is incompatible.[0m[31m
[0mSuccessfully installed dill-0.3.3
[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv[0m[33m
[0m

Note: Please DO NOT restart the runtime after installing the packages. Doing so might cause issues because the Python version was switched. If you restarted it by mistake, kindly go to Runtime > Disconnect and delete runtime, and then re-run the previous cells.

Next, you’ll define the working directory to contain all the results you will generate in this exercise. After each phase, you can open the Colab file explorer on the left and look under the results directory to see the new files and directories generated.

# Define working directory
WORK_DIR = "results"

With that, you are now ready to execute the pipeline. As shown in the figure earlier, the logic is already implemented in the four main scripts. You will run them one by one and the next sections will discuss relevant detail and the outputs generated.

Phase 1: Data extraction

The first step is to extract the input data. The dataset is stored as SDF files and is extracted from the National Center for Biotechnology Information (FTP source). Chapter 6 of this document shows a more detailed description of the SDF file format.

The data-extractor.py file extracts and decompresses the specified SDF files. In later steps, the example preprocesses these files and uses the data to train and evaluate the machine learning model. The file extracts the SDF files from the public source and stores them in a subdirectory inside the specified working directory.

As you can see here, the complete set of files is huge and can easily exceed storage limits in Colab. For this exercise, you will just download one file. You can use the script as shown in the cells below:

# Print the help documentation. You can ignore references to GCP because you will be running everything in Colab.
!python ./molecules/data-extractor.py --help

python3: can't open file './molecules/data-extractor.py': [Errno 2] No such file or directory

# Run the data extractor
!python ./molecules/data-extractor.py --max-data-files 1 --work-dir={WORK_DIR}

python3: can't open file './molecules/data-extractor.py': [Errno 2] No such file or directory

You should now have a new folder in your work directory called data. This will contain the SDF file you downloaded.

# List working directory
!ls {WORK_DIR}

ls: cannot access 'results': No such file or directory

In the SDF Documentation linked earlier, it shows that one record is terminated by $$$$. You can use the command below to print the first one in the file. As you’ll see, just one record is already pretty long. In the next phase, you’ll feed these records in a pipeline that will transform these into a form that can be consumed by our model.

# Print one record
!sed '/$$$$/q' {WORK_DIR}/data/00000001_00025000.sdf

sed: can't read results/data/00000001_00025000.sdf: No such file or directory

Phase 2: Preprocessing

The next script: preprocess.py uses an Apache Beam pipeline to preprocess the data. The pipeline performs the following preprocessing actions:

Reads and parses the extracted SDF files.
Counts the number of different atoms in each of the molecules in the files.
Normalizes the counts to values between 0 and 1 using tf.Transform.
Partitions the dataset into a training dataset and an evaluation dataset.
Writes the two datasets as TFRecord objects.

Apache Beam transforms can efficiently manipulate single elements at a time, but transforms that require a full pass of the dataset cannot easily be done with only Apache Beam and are better done using tf.Transform. Because of this, the code uses Apache Beam transforms to read and format the molecules, and to count the atoms in each molecule. The code then uses tf.Transform to find the global minimum and maximum counts in order to normalize the data.

The following image shows the steps in the pipeline.

Run the preprocessing pipeline

You will run the script first and the following sections will discuss the relevant parts of this code. This will take around 6 minutes to run.

# Print help documentation
!python ./molecules/preprocess.py --help

python3: can't open file './molecules/preprocess.py': [Errno 2] No such file or directory

# Run the preprocessing script
!python ./molecules/preprocess.py --work-dir={WORK_DIR}

python3: can't open file './molecules/preprocess.py': [Errno 2] No such file or directory

You should now have a few more outputs in your work directory. Most important are:

transform_fn - the transformation graph describing how to transform SDF inputs to tensors
transformed_metadata - contains the metadata describing the data types of the tf.Transform outputs
train-dataset - contains the transformed training dataset
eval-dataset - contains the transformed evaluation dataset
PreprocessData - pickled file containing variables you will use in the training phase

# List working directory
!ls {WORK_DIR}

ls: cannot access 'results': No such file or directory

The training and evaluation datasets contain TFRecords and you can view them by running the helper function in the cells below.

from google.protobuf.json_format import MessageToDict

# Define a helper function to get individual examples
def get_records(dataset, num_records):
    '''Extracts records from the given dataset.
    Args:
        dataset (TFRecordDataset): dataset saved in the preprocessing step
        num_records (int): number of records to preview
    '''

    # initialize an empty list
    records = []

    # Use the `take()` method to specify how many records to get
    for tfrecord in dataset.take(num_records):

        # Get the numpy property of the tensor
        serialized_example = tfrecord.numpy()

        # Initialize a `tf.train.Example()` to read the serialized data
        example = tf.train.Example()

        # Read the example data (output is a protocol buffer message)
        example.ParseFromString(serialized_example)

        # convert the protocol bufffer message to a Python dictionary
        example_dict = (MessageToDict(example))

        # append to the records list
        records.append(example_dict)

    return records

import tensorflow as tf
from pprint import pprint

# Create TF Dataset from TFRecord of training set
train_data = tf.data.TFRecordDataset(f'{WORK_DIR}/train-dataset/part-00000-of-00001')

# Print two records
test_data = get_records(train_data, 2)

pprint(test_data)

---------------------------------------------------------------------------

NotFoundError                             Traceback (most recent call last)

<ipython-input-13-dda645320f70> in <cell line: 8>()
      6 
      7 # Print two records
----> 8 test_data = get_records(train_data, 2)
      9 
     10 pprint(test_data)


<ipython-input-12-68f3a263de80> in get_records(dataset, num_records)
     13 
     14     # Use the `take()` method to specify how many records to get
---> 15     for tfrecord in dataset.take(num_records):
     16 
     17         # Get the numpy property of the tensor


/usr/local/lib/python3.10/dist-packages/tensorflow/python/data/ops/iterator_ops.py in __next__(self)
    795   def __next__(self):
    796     try:
--> 797       return self._next_internal()
    798     except errors.OutOfRangeError:
    799       raise StopIteration


/usr/local/lib/python3.10/dist-packages/tensorflow/python/data/ops/iterator_ops.py in _next_internal(self)
    778     # to communicate that there is no more data to iterate over.
    779     with context.execution_mode(context.SYNC):
--> 780       ret = gen_dataset_ops.iterator_get_next(
    781           self._iterator_resource,
    782           output_types=self._flat_output_types,


/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/gen_dataset_ops.py in iterator_get_next(iterator, output_types, output_shapes, name)
   3014       return _result
   3015     except _core._NotOkStatusException as e:
-> 3016       _ops.raise_from_not_ok_status(e, name)
   3017     except _core._FallbackException:
   3018       pass


/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/ops.py in raise_from_not_ok_status(e, name)
   7260 def raise_from_not_ok_status(e, name):
   7261   e.message += (" name: " + name if name is not None else "")
-> 7262   raise core._status_to_exception(e) from None  # pylint: disable=protected-access
   7263 
   7264 


NotFoundError:  results/train-dataset/part-00000-of-00001; No such file or directory [Op:IteratorGetNext]

Note: From the output cell above, you might concur that we’ll need more than the atom counts to make better predictions. You’ll notice that the counts are identical in both records but the Energy value is different. Thus, you cannot expect the model to have a low loss during the training phase later. For simplicity, we’ll just use atom counts in this exercise but feel free to revise later to have more predictive features. You can share your findings in our Discourse community to discuss with other learners who are interested in the same problem.

The PreprocessData contains Python objects needed in the training phase such as:

the filename patterns of the training and eval set directories
spec file describing the input features
name of the label column

These are saved in a serialized file using dill when you ran the preprocess script earlier and you can deserialize it using the cell below to view its contents.

# Import path to the Python3.7 packages so it can find the dill module
import sys
sys.path.insert(0, r"/usr/local/lib/python3.7/dist-packages")


import dill as pickle

# Helper function to load the serialized file
def load(filename):
  with tf.io.gfile.GFile(filename, 'rb') as f:
    return pickle.load(f)

# Load PreprocessData
preprocess_data = load('/content/results/PreprocessData')

# Print contents
pprint(vars(preprocess_data))

The next sections will describe how these are implemented as a Beam pipeline in preprocess.py. You can open this file in a separate text editor so you can look at it more closely while reviewing the snippets below.

Applying element-based transforms

The preprocess.py code creates an Apache Beam pipeline.

Click here to see the code snippet

# Build and run a Beam Pipeline
with beam.Pipeline(options=beam_options) as p, \
     beam_impl.Context(temp_dir=tft_temp_dir):

Next, the code applies a feature_extraction transform to the pipeline.

Click here to see the code snippet

# Transform and validate the input data matches the input schema
dataset = (
    p
    | 'Feature extraction' >> feature_extraction

The pipeline uses SimpleFeatureExtraction as its feature_extraction transform.

Click here to see the code snippet

pubchem.SimpleFeatureExtraction(pubchem.ParseSDF(data_files_pattern)),

The SimpleFeatureExtraction transform, defined in pubchem/pipeline.py, contains a series of transforms that manipulate all elements independently. First, the code parses the molecules from the source file, then formats the molecules to a dictionary of molecule properties, and finally, counts the atoms in the molecule. These counts are the features (inputs) for the machine learning model.

Click here to see the code snippet

class SimpleFeatureExtraction(beam.PTransform):
  """The feature extraction (element-wise transformations).

  We create a `PTransform` class. This `PTransform` is a bundle of
  transformations that can be applied to any other pipeline as a step.

  We'll extract all the raw features here. Due to the nature of `PTransform`s,
  we can only do element-wise transformations here. Anything that requires a
  full-pass of the data (such as feature scaling) has to be done with
  tf.Transform.
  """
  def __init__(self, source):
    super(SimpleFeatureExtraction, self).__init__()
    self.source = source

  def expand(self, p):
    # Return the preprocessing pipeline. In this case we're reading the PubChem
    # files, but the source could be any Apache Beam source.
    return (p
        | 'Read raw molecules' >> self.source
        | 'Format molecule' >> beam.ParDo(FormatMolecule())
        | 'Count atoms' >> beam.ParDo(CountAtoms())
    )

The read transform beam.io.Read(pubchem.ParseSDF(data_files_pattern)) reads SDF files from a custom source.

The custom source, called ParseSDF, is defined in pubchem/pipeline.py. ParseSDF extends FileBasedSource and implements the read_records function that opens the extracted SDF files.

The pipeline groups the raw data into sections of relevant information needed for the next steps. Each section in the parsed SDF file is stored in a dictionary (see pipeline/sdf.py), where the keys are the section names and the values are the raw line contents of the corresponding section.

The code applies beam.ParDo(FormatMolecule()) to the pipeline. The ParDo applies the DoFn named FormatMolecule to each molecule. FormatMolecule yields a dictionary of formatted molecules. The following snippet is an example of an element in the output PCollection:

Click here to see a sample output of beam.ParDo(FormatMolecule())

{
  'atoms': [
    {
      'atom_atom_mapping_number': 0,
      'atom_stereo_parity': 0,
      'atom_symbol': u'O',
      'charge': 0,
      'exact_change_flag': 0,
      'h0_designator': 0,
      'hydrogen_count': 0,
      'inversion_retention': 0,
      'mass_difference': 0,
      'stereo_care_box': 0,
      'valence': 0,
      'x': -0.0782,
      'y': -1.5651,
      'z': 1.3894,
    },
    ...
  ],
  'bonds': [
    {
      'bond_stereo': 0,
      'bond_topology': 0,
      'bond_type': 1,
      'first_atom_number': 1,
      'reacting_center_status': 0,
      'second_atom_number': 5,
    },
    ...
  ],
  '<PUBCHEM_COMPOUND_CID>': ['3\n'],
  ...
  '<PUBCHEM_MMFF94_ENERGY>': ['19.4085\n'],
  ...
}

Then, the code applies beam.ParDo(CountAtoms()) to the pipeline. The DoFn CountAtoms sums the number of carbon, hydrogen, nitrogen, and oxygen atoms each molecule has. CountAtoms outputs a PCollection of features and labels. Here is an example of an element in the output PCollection:

Click here to see a sample output of beam.ParDo(CountAtoms())

{
  'ID': 3,
  'TotalC': 7,
  'TotalH': 8,
  'TotalO': 4,
  'TotalN': 0,
  'Energy': 19.4085,
}

The pipeline then validates the inputs. The ValidateInputData DoFn validates that every element matches the metadata given in the input_schema. This validation ensures that the data is in the correct format when it’s fed into TensorFlow.

Click here to see the code snippet

| 'Validate inputs' >> beam.ParDo(ValidateInputData(
    input_feature_spec)))

Applying full-pass transforms

The Molecules code sample uses a Deep Neural Network Regressor to make predictions. The general recommendation is to normalize the inputs before feeding them into the ML model. The pipeline uses tf.Transform to normalize the counts of each atom to values between 0 and 1. To read more about normalizing inputs, see feature scaling.

Normalizing the values requires a full pass through the dataset, recording the minimum and maximum values. The code uses tf.Transform to go through the entire dataset and apply full-pass transforms.

To use tf.Transform, the code must provide a function that contains the logic of the transform to perform on the dataset. In preprocess.py, the code uses the AnalyzeAndTransformDataset transform provided by tf.Transform. Learn more about how to use tf.Transform.

Click here to see the code snippet

# Apply the tf.Transform preprocessing_fn
input_metadata = dataset_metadata.DatasetMetadata(
    dataset_schema.from_feature_spec(input_feature_spec))

dataset_and_metadata, transform_fn = (
    (dataset, input_metadata)
    | 'Feature scaling' >> beam_impl.AnalyzeAndTransformDataset(
        feature_scaling))
dataset, metadata = dataset_and_metadata

In preprocess.py, the feature_scaling function used is normalize_inputs, which is defined in pubchem/pipeline.py. The function uses the tf.Transform function scale_to_0_1 to normalize the counts to values between 0 and 1.

Click here to see the code snippet

def normalize_inputs(inputs):
  """Preprocessing function for tf.Transform (full-pass transformations).

  Here we will do any preprocessing that requires a full-pass of the dataset.
  It takes as inputs the preprocessed data from the `PTransform` we specify, in
  this case `SimpleFeatureExtraction`.

  Common operations might be scaling values to 0-1, getting the minimum or
  maximum value of a certain field, creating a vocabulary for a string field.

  There are two main types of transformations supported by tf.Transform, for
  more information, check the following modules:
    - analyzers: tensorflow_transform.analyzers.py
    - mappers:   tensorflow_transform.mappers.py

  Any transformation done in tf.Transform will be embedded into the TensorFlow
  model itself.
  """
  return {
      # Scale the input features for normalization
      'NormalizedC': tft.scale_to_0_1(inputs['TotalC']),
      'NormalizedH': tft.scale_to_0_1(inputs['TotalH']),
      'NormalizedO': tft.scale_to_0_1(inputs['TotalO']),
      'NormalizedN': tft.scale_to_0_1(inputs['TotalN']),

      # Do not scale the label since we want the absolute number for prediction
      'Energy': inputs['Energy'],
  }

Partitioning the dataset

Next, the preprocess.py pipeline partitions the single dataset into two datasets. It allocates approximately 80% of the data to be used as training data, and approximately 20% of the data to be used as evaluation data.

Click here to see the code snippet

# Split the dataset into a training set and an evaluation set
assert 0 < eval_percent < 100, 'eval_percent must in the range (0-100)'
train_dataset, eval_dataset = (
    dataset
    | 'Split dataset' >> beam.Partition(
        lambda elem, _: int(random.uniform(0, 100) < eval_percent), 2))

Writing the output

Finally, the preprocess.py pipeline writes the two datasets (training and evaluation) using the WriteToTFRecord transform.

Click here to see the code snippet

# Write the datasets as TFRecords
coder = example_proto_coder.ExampleProtoCoder(metadata.schema)

train_dataset_prefix = os.path.join(train_dataset_dir, 'part')
_ = (
    train_dataset
    | 'Write train dataset' >> tfrecordio.WriteToTFRecord(
        train_dataset_prefix, coder))

eval_dataset_prefix = os.path.join(eval_dataset_dir, 'part')
_ = (
    eval_dataset
    | 'Write eval dataset' >> tfrecordio.WriteToTFRecord(
        eval_dataset_prefix, coder))

# Write the transform_fn
_ = (
    transform_fn
    | 'Write transformFn' >> transform_fn_io.WriteTransformFn(work_dir))

Phase 3: Training

Recall that at the end of the preprocessing phase, the code split the data into two datasets (training and evaluation).

The script uses a simple dense neural network for the regression problem. The trainer/task.py file contains the code for training the model. The main function of trainer/task.py loads the parameters needed from the preprocessing phase and passes it to the task runner function (i.e. run_fn).

In this exercise, we will not focus too much on the training metrics (e.g. accuracy). That is discussed in other courses of this specialization. The main objective is to look at the outputs and how it is connected to the prediction phase.

# Print help documentation
!python ./molecules/trainer/task.py --help

# Run the trainer.
!python ./molecules/trainer/task.py --work-dir {WORK_DIR}

The outputs of this phase are in the model directory. This will be the trained model that you will use for predictions.

!ls {WORK_DIR}/model

The important thing to note in the training script is it also exports the transformation graph with the model. That is shown in these lines:

Click here to see the code snippet

# Define default serving signature
signatures = {
    'serving_default':
        _get_serve_tf_examples_fn(model,
                                  tf_transform_output, input_feature_spec).get_concrete_function(
                                      [signatures_dict])
}

# Save model with signature
model.save(fn_args.serving_model_dir, save_format='tf', signatures=signatures, include_optimizer=False)

The implementation of _get_serve_tf_examples_fn() is as follows:

Click here to see the code snippet

def _get_serve_tf_examples_fn(model, tf_transform_output, feature_spec):
  """Returns a function that applies data transformation and generates predictions"""

  # Get transformation graph
  model.tft_layer = tf_transform_output.transform_features_layer()

  @tf.function
  def serve_tf_examples_fn(inputs_list):
    """Returns the output to be used in the serving signature."""
    
    # Create a shallow copy of the dictionary in the single element list
    inputs = inputs_list[0].copy()

    # Pop ID since it is not needed in the transformation graph
    # Also needed to identify predictions
    id_key = inputs.pop('ID')
    
    # Apply data transformation to the raw inputs
    transformed = model.tft_layer(inputs)

    # Pass the transformed data to the model to get predictions
    predictions = model(transformed.values())

    return id_key, predictions

  return serve_tf_examples_fn

The use of model.tft_layer means that your model can accept raw data and it will do the transformation before feeding it to make predictions. It implies that when you serve your model for predictions, you don’t have to worry about creating a pipeline to transform new data coming in. The model will already do that for you through this serving input function. It helps to prevent training-serving skew since you’re handling the training and serving data the same way.

Phase 4: Prediction

After training the model, you can provide the model with inputs and it will make predictions. The pipeline in predict.py is responsible for making predictions. It reads the input files from the custom source and writes the output predictions as text files to the specified working directory.

Click here to see the code snippet

if args.verb == 'batch':
  data_files_pattern = os.path.join(args.inputs_dir, '*.sdf')
  results_prefix = os.path.join(args.outputs_dir, 'part')
  source = pubchem.ParseSDF(data_files_pattern)
  sink = beam.io.WriteToText(results_prefix)

The following image shows the steps in the prediction pipeline:

In predict.py, the code defines the pipeline in the run function:

Click here to see the code snippet

def run(model_dir, feature_extraction, sink, beam_options=None):
  print('Listening...')
  with beam.Pipeline(options=beam_options) as p:
    _ = (p
        | 'Feature extraction' >> feature_extraction
        | 'Predict' >> beam.ParDo(Predict(model_dir, 'ID'))
        | 'Format as JSON' >> beam.Map(json.dumps)
        | 'Write predictions' >> sink)

The code calls the run function with the following parameters:

Click here to see the code snippet

run(
    args.model_dir,
    pubchem.SimpleFeatureExtraction(source),
    sink,
    beam_options)

First, the code passes the pubchem.SimpleFeatureExtraction(source) transform as the feature_extraction transform. This transform, which was also used in the preprocessing phase, is applied to the pipeline:

Click here to see the code snippet

class SimpleFeatureExtraction(beam.PTransform):
  """The feature extraction (element-wise transformations).

  We create a `PTransform` class. This `PTransform` is a bundle of
  transformations that can be applied to any other pipeline as a step.

  We'll extract all the raw features here. Due to the nature of `PTransform`s,
  we can only do element-wise transformations here. Anything that requires a
  full-pass of the data (such as feature scaling) has to be done with
  tf.Transform.
  """
  def __init__(self, source):
    super(SimpleFeatureExtraction, self).__init__()
    self.source = source

  def expand(self, p):
    # Return the preprocessing pipeline. In this case we're reading the PubChem
    # files, but the source could be any Apache Beam source.
    return (p
        | 'Read raw molecules' >> self.source
        | 'Format molecule' >> beam.ParDo(FormatMolecule())
        | 'Count atoms' >> beam.ParDo(CountAtoms())
    )

The transform reads from the appropriate source based on the pipeline’s execution mode (i.e. batch), formats the molecules, and counts the different atoms in each molecule.

Next, beam.ParDo(Predict(…)) is applied to the pipeline that performs the prediction of the molecular energy. Predict, the DoFn that’s passed, uses the given dictionary of input features (atom counts), to predict the molecular energy.

The next transform applied to the pipeline is beam.Map(lambda result: json.dumps(result)), which takes the prediction result dictionary and serializes it into a JSON string. Finally, the output is written to the sink.

Batch predictions

Batch predictions are optimized for throughput rather than latency. Batch predictions work best if you’re making many predictions and you can wait for all of them to finish before getting the results. You can run the following cells to use the script to run batch predictions. For simplicity, you will use the same file you used for training. If you want however, you can use the data extractor script earlier to grab a different SDF file and feed it here.

# Print help documentation. You can ignore references to GCP and streaming data.
!python ./molecules/predict.py --help

# Define model, input and output data directories
MODEL_DIR = f'{WORK_DIR}/model'
DATA_DIR = f'{WORK_DIR}/data'
PRED_DIR = f'{WORK_DIR}/predictions'

# Run batch prediction. This can take up to 7 minutes.
!python ./molecules/predict.py \
  --model-dir {MODEL_DIR} \
  --work-dir {WORK_DIR} \
  batch \
  --inputs-dir {DATA_DIR} \
  --outputs-dir {PRED_DIR}

The results should now be in the predictions folder. This is just a text file so you can easily print the output.

# List working directory
!ls {WORK_DIR}

# Print the first 100 results
!head -n 100 /content/results/predictions/part-00000-of-00001

Wrap Up

You’ve now completed all phases of the Beam-based pipeline! Similar processes are done under the hood by higher-level frameworks such as TFX and you can use the techniques here to understand their codebase better or to extend them for your own needs. As mentioned earlier, the original article also offers the option to use GCP and to perform online predictions as well. Feel free to try it out but be aware of the recurring costs.

On to the next part of the course!