In this lab, you will create, train, evaluate, and make predictions on a model using Apache Beam and TensorFlow. In particular, you will train a model to predict the molecular energy based on the number of carbon, hydrogen, oxygen, and nitrogen atoms.
This lab is marked as optional because you will not be interacting with Beam-based systems directly in future exercises. Other courses of this specialization also use tools that abstract this layer. Nonetheless, it would be good to be familiar with it since it is used under the hood by TFX which is the main ML pipelines framework that you will use in other labs. Seeing how these systems work will let you explore other codebases that use this tool more freely and even make contributions or bug fixes as you see fit. If you don’t know the basics of Beam yet, we encourage you to look at the Minimal Word Count example here for a quick start and use the Beam Programming Guide to look up concepts if needed.
The entire pipeline can be divided into four phases:
You will focus particularly on Phase 2 (Preprocessing) and a bit of Phase 4 (Predictions) because these use Beam in its implementation.
Let’s begin!
Note: This tutorial uses code, images, and discussion from this article. We highlighted a few key parts and updated some of the code to use more recent versions. Also, we focused on making the lab running locally. The original article linked above contain instructions on running it in GCP. Just take note that it will have associated costs depending on the resources you use.
You will first setup the environment and download the scripts that you will use in the lab.
# NOTE: Some of the packages used in this lab are not compatible with the
# default Python version in Colab (Python3.8 as of January 2023).
# The commands below will setup the environment to use Python3.7 instead.
# Please *DO NOT* restart the runtime after running these.
# Install packages needed to downgrade to Python3.7
!apt-get install python3.7 python3.7-distutils python3-pip
# Configure the Colab environment to use Python3.7 by default
!update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.7 100
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
libpython3.7-minimal libpython3.7-stdlib mailcap mime-support
python3-setuptools python3-wheel python3.7-lib2to3 python3.7-minimal
Suggested packages:
python-setuptools-doc python3.7-venv binfmt-support
The following NEW packages will be installed:
libpython3.7-minimal libpython3.7-stdlib mailcap mime-support python3-pip
python3-setuptools python3-wheel python3.7 python3.7-distutils
python3.7-lib2to3 python3.7-minimal
0 upgraded, 11 newly installed, 0 to remove and 16 not upgraded.
Need to get 6,688 kB of archives.
After this operation, 28.0 MB of additional disk space will be used.
Get:1 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy/main amd64 libpython3.7-minimal amd64 3.7.17-1+jammy1 [608 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/main amd64 mailcap all 3.70+nmu1ubuntu1 [23.8 kB]
Get:3 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy/main amd64 python3.7-minimal amd64 3.7.17-1+jammy1 [1,837 kB]
Get:4 http://archive.ubuntu.com/ubuntu jammy/main amd64 mime-support all 3.66 [3,696 B]
Get:5 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 python3-setuptools all 59.6.0-1.2ubuntu0.22.04.1 [339 kB]
Get:6 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy/main amd64 libpython3.7-stdlib amd64 3.7.17-1+jammy1 [1,864 kB]
Get:7 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy/main amd64 python3.7 amd64 3.7.17-1+jammy1 [362 kB]
Get:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy/main amd64 python3.7-lib2to3 all 3.7.17-1+jammy1 [124 kB]
Get:9 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 python3-wheel all 0.37.1-2ubuntu0.22.04.1 [32.0 kB]
Get:10 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 python3-pip all 22.0.2+dfsg-1ubuntu0.3 [1,305 kB]
Get:11 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy/main amd64 python3.7-distutils all 3.7.17-1+jammy1 [189 kB]
Fetched 6,688 kB in 1s (7,477 kB/s)
Selecting previously unselected package libpython3.7-minimal:amd64.
(Reading database ... 120831 files and directories currently installed.)
Preparing to unpack .../00-libpython3.7-minimal_3.7.17-1+jammy1_amd64.deb ...
Unpacking libpython3.7-minimal:amd64 (3.7.17-1+jammy1) ...
Selecting previously unselected package python3.7-minimal.
Preparing to unpack .../01-python3.7-minimal_3.7.17-1+jammy1_amd64.deb ...
Unpacking python3.7-minimal (3.7.17-1+jammy1) ...
Selecting previously unselected package mailcap.
Preparing to unpack .../02-mailcap_3.70+nmu1ubuntu1_all.deb ...
Unpacking mailcap (3.70+nmu1ubuntu1) ...
Selecting previously unselected package mime-support.
Preparing to unpack .../03-mime-support_3.66_all.deb ...
Unpacking mime-support (3.66) ...
Selecting previously unselected package libpython3.7-stdlib:amd64.
Preparing to unpack .../04-libpython3.7-stdlib_3.7.17-1+jammy1_amd64.deb ...
Unpacking libpython3.7-stdlib:amd64 (3.7.17-1+jammy1) ...
Selecting previously unselected package python3-setuptools.
Preparing to unpack .../05-python3-setuptools_59.6.0-1.2ubuntu0.22.04.1_all.deb ...
Unpacking python3-setuptools (59.6.0-1.2ubuntu0.22.04.1) ...
Selecting previously unselected package python3-wheel.
Preparing to unpack .../06-python3-wheel_0.37.1-2ubuntu0.22.04.1_all.deb ...
Unpacking python3-wheel (0.37.1-2ubuntu0.22.04.1) ...
Selecting previously unselected package python3-pip.
Preparing to unpack .../07-python3-pip_22.0.2+dfsg-1ubuntu0.3_all.deb ...
Unpacking python3-pip (22.0.2+dfsg-1ubuntu0.3) ...
Selecting previously unselected package python3.7.
Preparing to unpack .../08-python3.7_3.7.17-1+jammy1_amd64.deb ...
Unpacking python3.7 (3.7.17-1+jammy1) ...
Selecting previously unselected package python3.7-lib2to3.
Preparing to unpack .../09-python3.7-lib2to3_3.7.17-1+jammy1_all.deb ...
Unpacking python3.7-lib2to3 (3.7.17-1+jammy1) ...
Selecting previously unselected package python3.7-distutils.
Preparing to unpack .../10-python3.7-distutils_3.7.17-1+jammy1_all.deb ...
Unpacking python3.7-distutils (3.7.17-1+jammy1) ...
Setting up python3-setuptools (59.6.0-1.2ubuntu0.22.04.1) ...
Setting up libpython3.7-minimal:amd64 (3.7.17-1+jammy1) ...
Setting up python3-wheel (0.37.1-2ubuntu0.22.04.1) ...
Setting up python3-pip (22.0.2+dfsg-1ubuntu0.3) ...
Setting up python3.7-minimal (3.7.17-1+jammy1) ...
Setting up python3.7-lib2to3 (3.7.17-1+jammy1) ...
Setting up mailcap (3.70+nmu1ubuntu1) ...
Setting up mime-support (3.66) ...
Setting up python3.7-distutils (3.7.17-1+jammy1) ...
Setting up libpython3.7-stdlib:amd64 (3.7.17-1+jammy1) ...
Setting up python3.7 (3.7.17-1+jammy1) ...
Processing triggers for man-db (2.10.2-1) ...
update-alternatives: using /usr/bin/python3.7 to provide /usr/bin/python3 (python3) in auto mode
# Download the scripts
!wget https://github.com/https-deeplearning-ai/machine-learning-engineering-for-production-public/raw/main/course4/week2-ungraded-labs/C4_W2_Lab_4_ETL_Beam/data/molecules.tar.gz
# Unzip the archive
!tar -xvzf molecules.tar.gz
--2023-08-30 04:46:22-- https://github.com/https-deeplearning-ai/machine-learning-engineering-for-production-public/raw/main/course4/week2-ungraded-labs/C4_W2_Lab_4_ETL_Beam/data/molecules.tar.gz
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/https-deeplearning-ai/machine-learning-engineering-for-production-public/main/course4/week2-ungraded-labs/C4_W2_Lab_4_ETL_Beam/data/molecules.tar.gz [following]
--2023-08-30 04:46:23-- https://raw.githubusercontent.com/https-deeplearning-ai/machine-learning-engineering-for-production-public/main/course4/week2-ungraded-labs/C4_W2_Lab_4_ETL_Beam/data/molecules.tar.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10603 (10K) [application/octet-stream]
Saving to: ‘molecules.tar.gz’
molecules.tar.gz 100%[===================>] 10.35K --.-KB/s in 0s
2023-08-30 04:46:23 (99.0 MB/s) - ‘molecules.tar.gz’ saved [10603/10603]
data-extractor.py
predict.py
preprocess.py
pubchem/
pubchem/__init__.py
pubchem/sdf.py
pubchem/pipeline.py
trainer/
trainer/task.py
trainer/__init__.py
The molecules
directory you downloaded mainly contain 4 scripts that encapsulate all phases of the workflow you will execute in this lab. It is summarized by the figure below:
It also includes the pubchem
subdirectory which contains common modules (i.e. pipeline.py
and sdf.py
) shared by the preprocessing and predicition phases. If you look at preprocess.py
and predict.py
, you can see the line import as pubchem
at the top.
You will then install some packages needed in this notebook. You will use the Apache Beam version bundled with Tensorflow Transform. You can safely ignore the incompatibility error shown after installing dill
.
# IMPORTANT NOTE: Please DO NOT restart the runtime after running the commands
# below. Doing so might cause issues because the Python runtime was switched.
# Install packages
!pip install tensorflow-transform==1.12
!pip install dill==0.3.3
Collecting tensorflow-transform==1.12
Downloading tensorflow_transform-1.12.0-py3-none-any.whl (439 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m439.8/439.8 KB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyarrow<7,>=6
Downloading pyarrow-6.0.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25.6 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m25.6/25.6 MB[0m [31m49.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting apache-beam[gcp]<3,>=2.41
Downloading apache_beam-2.48.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.5 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.5/13.5 MB[0m [31m87.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting absl-py<2.0.0,>=0.9
Downloading absl_py-1.4.0-py3-none-any.whl (126 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.5/126.5 KB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting numpy<2,>=1.16
Downloading numpy-1.21.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.7/15.7 MB[0m [31m43.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tensorflow-metadata<1.13.0,>=1.12.0
Downloading tensorflow_metadata-1.12.0-py3-none-any.whl (52 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.3/52.3 KB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pydot<2,>=1.2
Downloading pydot-1.4.2-py2.py3-none-any.whl (21 kB)
Collecting tensorflow<2.12,>=2.11.0
Downloading tensorflow-2.11.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (588.3 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m588.3/588.3 MB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting protobuf<4,>=3.13
Downloading protobuf-3.20.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.0 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m69.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tfx-bsl<1.13.0,>=1.12.0
Downloading tfx_bsl-1.12.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (21.6 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.6/21.6 MB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting objsize<0.7.0,>=0.6.1
Downloading objsize-0.6.1-py3-none-any.whl (9.3 kB)
Collecting python-dateutil<3,>=2.8.0
Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m247.7/247.7 KB[0m [31m26.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pymongo<5.0.0,>=3.8.0
Downloading pymongo-4.5.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (653 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m653.7/653.7 KB[0m [31m54.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting crcmod<2.0,>=1.7
Downloading crcmod-1.7.tar.gz (89 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.7/89.7 KB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[?25h Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dill<0.3.2,>=0.3.1.1
Downloading dill-0.3.1.1.tar.gz (151 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m152.0/152.0 KB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[?25h Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting zstandard<1,>=0.18.0
Downloading zstandard-0.21.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.7 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.7/2.7 MB[0m [31m74.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting orjson<4.0
Downloading orjson-3.9.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (139 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.7/139.7 KB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pytz>=2018.3
Downloading pytz-2023.3-py2.py3-none-any.whl (502 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m502.3/502.3 KB[0m [31m33.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting httplib2<0.23.0,>=0.8
Downloading httplib2-0.22.0-py3-none-any.whl (96 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m96.9/96.9 KB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting proto-plus<2,>=1.7.1
Downloading proto_plus-1.22.3-py3-none-any.whl (48 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.1/48.1 KB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting hdfs<3.0.0,>=2.1.0
Downloading hdfs-2.7.2.tar.gz (43 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.4/43.4 KB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25h Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting regex>=2020.6.8
Downloading regex-2023.8.8-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (758 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m759.0/759.0 KB[0m [31m62.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cloudpickle~=2.2.1
Downloading cloudpickle-2.2.1-py3-none-any.whl (25 kB)
Collecting fastavro<2,>=0.23.6
Downloading fastavro-1.8.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m92.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting typing-extensions>=3.7.0
Downloading typing_extensions-4.7.1-py3-none-any.whl (33 kB)
Collecting fasteners<1.0,>=0.3
Downloading fasteners-0.18-py3-none-any.whl (18 kB)
Collecting requests<3.0.0,>=2.24.0
Downloading requests-2.31.0-py3-none-any.whl (62 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.6/62.6 KB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting grpcio!=1.48.0,<2,>=1.33.1
Downloading grpcio-1.57.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.3 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.3/5.3 MB[0m [31m102.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-cloud-recommendations-ai<0.11.0,>=0.1.0
Downloading google_cloud_recommendations_ai-0.10.4-py2.py3-none-any.whl (173 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m173.3/173.3 KB[0m [31m22.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-cloud-bigquery<4,>=2.0.0
Downloading google_cloud_bigquery-3.11.4-py2.py3-none-any.whl (219 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m219.6/219.6 KB[0m [31m27.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-cloud-vision<4,>=2
Downloading google_cloud_vision-3.4.4-py2.py3-none-any.whl (444 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m444.0/444.0 KB[0m [31m43.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-cloud-language<3,>=2.0
Downloading google_cloud_language-2.11.0-py2.py3-none-any.whl (138 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m138.7/138.7 KB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-cloud-dlp<4,>=3.0.0
Downloading google_cloud_dlp-3.12.2-py2.py3-none-any.whl (143 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.4/143.4 KB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-cloud-bigtable<2.18.0,>=2.0.0
Downloading google_cloud_bigtable-2.17.0-py2.py3-none-any.whl (288 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m288.6/288.6 KB[0m [31m31.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-cloud-core<3,>=2.0.0
Downloading google_cloud_core-2.3.3-py2.py3-none-any.whl (29 kB)
Collecting google-cloud-pubsublite<2,>=1.2.0
Downloading google_cloud_pubsublite-1.7.0-py2.py3-none-any.whl (273 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m273.9/273.9 KB[0m [31m31.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cachetools<6,>=3.1.0
Downloading cachetools-5.3.1-py3-none-any.whl (9.3 kB)
Collecting google-auth-httplib2<0.2.0,>=0.1.0
Downloading google_auth_httplib2-0.1.0-py2.py3-none-any.whl (9.3 kB)
Collecting google-cloud-bigquery-storage<3,>=2.6.3
Downloading google_cloud_bigquery_storage-2.22.0-py2.py3-none-any.whl (190 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.9/190.9 KB[0m [31m21.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-cloud-pubsub<3,>=2.1.0
Downloading google_cloud_pubsub-2.18.3-py2.py3-none-any.whl (265 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.9/265.9 KB[0m [31m30.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-apitools<0.5.32,>=0.5.31
Downloading google-apitools-0.5.31.tar.gz (173 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m173.5/173.5 KB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m
[?25h Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting google-cloud-datastore<3,>=2.0.0
Downloading google_cloud_datastore-2.17.0-py2.py3-none-any.whl (177 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.1/177.1 KB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-cloud-spanner<4,>=3.0.0
Downloading google_cloud_spanner-3.40.1-py2.py3-none-any.whl (332 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m332.9/332.9 KB[0m [31m38.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-auth<3,>=1.18.0
Downloading google_auth-2.22.0-py2.py3-none-any.whl (181 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m181.8/181.8 KB[0m [31m24.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-cloud-videointelligence<3,>=2.0
Downloading google_cloud_videointelligence-2.11.3-py2.py3-none-any.whl (229 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m229.4/229.4 KB[0m [31m29.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyparsing>=2.1.4
Downloading pyparsing-3.1.1-py3-none-any.whl (103 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m103.1/103.1 KB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting packaging
Downloading packaging-23.1-py3-none-any.whl (48 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.9/48.9 KB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hRequirement already satisfied: setuptools in /usr/lib/python3/dist-packages (from tensorflow<2.12,>=2.11.0->tensorflow-transform==1.12) (59.6.0)
Collecting opt-einsum>=2.3.2
Downloading opt_einsum-3.3.0-py3-none-any.whl (65 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.5/65.5 KB[0m [31m669.8 kB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-pasta>=0.1.1
Downloading google_pasta-0.2.0-py3-none-any.whl (57 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.5/57.5 KB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting termcolor>=1.1.0
Downloading termcolor-2.3.0-py3-none-any.whl (6.9 kB)
Collecting tensorboard<2.12,>=2.11
Downloading tensorboard-2.11.2-py3-none-any.whl (6.0 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.0/6.0 MB[0m [31m106.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting astunparse>=1.6.0
Downloading astunparse-1.6.3-py2.py3-none-any.whl (12 kB)
Collecting tensorflow-io-gcs-filesystem>=0.23.1
Downloading tensorflow_io_gcs_filesystem-0.33.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (2.4 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m74.1 MB/s[0m eta [36m0:00:00[0m
[?25hINFO: pip is looking at multiple versions of pydot to determine which version is compatible with other requirements. This could take a while.
Collecting pydot<2,>=1.2
Downloading pydot-1.4.1-py2.py3-none-any.whl (19 kB)
INFO: pip is looking at multiple versions of pyarrow to determine which version is compatible with other requirements. This could take a while.
Collecting pyarrow<7,>=6
Downloading pyarrow-6.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25.5 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m25.5/25.5 MB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
[?25hINFO: pip is looking at multiple versions of protobuf to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of numpy to determine which version is compatible with other requirements. This could take a while.
Collecting numpy<2,>=1.16
Downloading numpy-1.21.5-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.7/15.7 MB[0m [31m83.5 MB/s[0m eta [36m0:00:00[0m
[?25hINFO: pip is looking at multiple versions of apache-beam[gcp] to determine which version is compatible with other requirements. This could take a while.
Collecting apache-beam[gcp]<3,>=2.41
Downloading apache_beam-2.47.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.5 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.5/13.5 MB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting httplib2<0.22.0,>=0.8
Downloading httplib2-0.21.0-py3-none-any.whl (96 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m96.8/96.8 KB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cachetools<5,>=3.1.0
Downloading cachetools-4.2.4-py3-none-any.whl (10 kB)
Collecting google-cloud-bigtable<3,>=2.0.0
Downloading google_cloud_bigtable-2.21.0-py2.py3-none-any.whl (293 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m293.0/293.0 KB[0m [31m27.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting apache-beam[gcp]<3,>=2.41
Downloading apache_beam-2.46.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.4 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.4/13.4 MB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pymongo<4.0.0,>=3.8.0
Downloading pymongo-3.13.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (506 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m506.0/506.0 KB[0m [31m44.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-cloud-videointelligence<2,>=1.8.0
Downloading google_cloud_videointelligence-1.16.3-py2.py3-none-any.whl (183 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.9/183.9 KB[0m [31m24.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-cloud-language<2,>=1.3.0
Downloading google_cloud_language-1.3.2-py2.py3-none-any.whl (83 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.6/83.6 KB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-cloud-recommendations-ai<0.8.0,>=0.1.0
Downloading google_cloud_recommendations_ai-0.7.1-py2.py3-none-any.whl (148 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m148.2/148.2 KB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-cloud-bigtable<2,>=0.31.1
Downloading google_cloud_bigtable-1.7.3-py2.py3-none-any.whl (268 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.7/268.7 KB[0m [31m33.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-cloud-bigquery-storage<2.17,>=2.6.3
Downloading google_cloud_bigquery_storage-2.16.2-py2.py3-none-any.whl (185 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m185.4/185.4 KB[0m [31m25.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-cloud-datastore<2,>=1.8.0
Downloading google_cloud_datastore-1.15.5-py2.py3-none-any.whl (134 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.2/134.2 KB[0m [31m20.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting protobuf<4,>=3.13
Downloading protobuf-3.19.6-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m77.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting flatbuffers>=2.0
Downloading flatbuffers-23.5.26-py2.py3-none-any.whl (26 kB)
Collecting keras<2.12,>=2.11.0
Downloading keras-2.11.0-py2.py3-none-any.whl (1.7 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m93.2 MB/s[0m eta [36m0:00:00[0m
[?25hRequirement already satisfied: six>=1.12.0 in /usr/lib/python3/dist-packages (from tensorflow<2.12,>=2.11.0->tensorflow-transform==1.12) (1.16.0)
Collecting gast<=0.4.0,>=0.2.1
Downloading gast-0.4.0-py3-none-any.whl (9.8 kB)
Collecting h5py>=2.9.0
Downloading h5py-3.8.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.3 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.3/4.3 MB[0m [31m108.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tensorflow-estimator<2.12,>=2.11.0
Downloading tensorflow_estimator-2.11.0-py2.py3-none-any.whl (439 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m439.2/439.2 KB[0m [31m45.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting libclang>=13.0.0
Downloading libclang-16.0.6-py2.py3-none-manylinux2010_x86_64.whl (22.9 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m22.9/22.9 MB[0m [31m74.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting wrapt>=1.11.0
Downloading wrapt-1.15.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (75 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.7/75.7 KB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting googleapis-common-protos<2,>=1.52.0
Downloading googleapis_common_protos-1.60.0-py2.py3-none-any.whl (227 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.6/227.6 KB[0m [31m24.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pandas<2,>=1.0
Downloading pandas-1.3.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.3 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.3/11.3 MB[0m [31m109.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tensorflow-serving-api!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,<3,>=1.15
Downloading tensorflow_serving_api-2.11.1-py2.py3-none-any.whl (37 kB)
Collecting google-api-python-client<2,>=1.7.11
Downloading google_api_python_client-1.12.11-py2.py3-none-any.whl (62 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.1/62.1 KB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25hRequirement already satisfied: wheel<1.0,>=0.23.0 in /usr/lib/python3/dist-packages (from astunparse>=1.6.0->tensorflow<2.12,>=2.11.0->tensorflow-transform==1.12) (0.37.1)
Collecting google-api-core<3dev,>=1.21.0
Downloading google_api_core-2.11.1-py3-none-any.whl (120 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m120.5/120.5 KB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting uritemplate<4dev,>=3.0.0
Downloading uritemplate-3.0.1-py2.py3-none-any.whl (15 kB)
Collecting oauth2client>=1.4.12
Downloading oauth2client-4.1.3-py2.py3-none-any.whl (98 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.2/98.2 KB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyasn1-modules>=0.2.1
Downloading pyasn1_modules-0.3.0-py2.py3-none-any.whl (181 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m181.3/181.3 KB[0m [31m24.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting urllib3<2.0
Downloading urllib3-1.26.16-py2.py3-none-any.whl (143 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.1/143.1 KB[0m [31m19.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rsa<5,>=3.1.4
Downloading rsa-4.9-py3-none-any.whl (34 kB)
Collecting google-resumable-media<3.0dev,>=0.6.0
Downloading google_resumable_media-2.5.0-py2.py3-none-any.whl (77 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.7/77.7 KB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting grpc-google-iam-v1<0.13dev,>=0.12.3
Downloading grpc_google_iam_v1-0.12.6-py2.py3-none-any.whl (26 kB)
Collecting grpcio-status>=1.33.2
Downloading grpcio_status-1.57.0-py3-none-any.whl (5.1 kB)
Collecting overrides<7.0.0,>=6.0.1
Downloading overrides-6.5.0-py3-none-any.whl (17 kB)
Collecting sqlparse>=0.4.4
Downloading sqlparse-0.4.4-py3-none-any.whl (41 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.2/41.2 KB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting docopt
Downloading docopt-0.6.2.tar.gz (25 kB)
Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting charset-normalizer<4,>=2
Downloading charset_normalizer-3.2.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (175 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m175.8/175.8 KB[0m [31m23.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting certifi>=2017.4.17
Downloading certifi-2023.7.22-py3-none-any.whl (158 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m158.3/158.3 KB[0m [31m20.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting idna<4,>=2.5
Downloading idna-3.4-py3-none-any.whl (61 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.5/61.5 KB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tensorboard-data-server<0.7.0,>=0.6.0
Downloading tensorboard_data_server-0.6.1-py3-none-manylinux2010_x86_64.whl (4.9 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.9/4.9 MB[0m [31m108.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting markdown>=2.6.8
Downloading Markdown-3.4.4-py3-none-any.whl (94 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.2/94.2 KB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting werkzeug>=1.0.1
Downloading Werkzeug-2.2.3-py3-none-any.whl (233 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.6/233.6 KB[0m [31m27.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-auth-oauthlib<0.5,>=0.4.1
Downloading google_auth_oauthlib-0.4.6-py2.py3-none-any.whl (18 kB)
Collecting tensorboard-plugin-wit>=1.6.0
Downloading tensorboard_plugin_wit-1.8.1-py3-none-any.whl (781 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m781.3/781.3 KB[0m [31m61.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tensorflow-serving-api!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,<3,>=1.15
Downloading tensorflow_serving_api-2.11.0-py2.py3-none-any.whl (37 kB)
Collecting requests-oauthlib>=0.7.0
Downloading requests_oauthlib-1.3.1-py2.py3-none-any.whl (23 kB)
Collecting google-crc32c<2.0dev,>=1.0
Downloading google_crc32c-1.5.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (32 kB)
Collecting grpcio-status>=1.33.2
Downloading grpcio_status-1.56.2-py3-none-any.whl (5.1 kB)
Downloading grpcio_status-1.56.0-py3-none-any.whl (5.1 kB)
Downloading grpcio_status-1.55.3-py3-none-any.whl (5.1 kB)
Downloading grpcio_status-1.54.3-py3-none-any.whl (5.1 kB)
Downloading grpcio_status-1.54.2-py3-none-any.whl (5.1 kB)
Downloading grpcio_status-1.54.0-py3-none-any.whl (5.1 kB)
Downloading grpcio_status-1.53.2-py3-none-any.whl (5.1 kB)
Downloading grpcio_status-1.53.1-py3-none-any.whl (5.1 kB)
Downloading grpcio_status-1.53.0-py3-none-any.whl (5.1 kB)
Downloading grpcio_status-1.51.3-py3-none-any.whl (5.1 kB)
Downloading grpcio_status-1.51.1-py3-none-any.whl (5.1 kB)
Downloading grpcio_status-1.50.0-py3-none-any.whl (14 kB)
Downloading grpcio_status-1.49.1-py3-none-any.whl (14 kB)
Downloading grpcio_status-1.48.2-py3-none-any.whl (14 kB)
Collecting importlib-metadata>=4.4
Downloading importlib_metadata-6.7.0-py3-none-any.whl (22 kB)
Collecting pyasn1>=0.1.7
Downloading pyasn1-0.5.0-py2.py3-none-any.whl (83 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.9/83.9 KB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting MarkupSafe>=2.1.1
Downloading MarkupSafe-2.1.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB)
Collecting zipp>=0.5
Downloading zipp-3.15.0-py3-none-any.whl (6.8 kB)
Collecting oauthlib>=3.0.0
Downloading oauthlib-3.2.2-py3-none-any.whl (151 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m151.7/151.7 KB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: crcmod, dill, google-apitools, hdfs, docopt
Building wheel for crcmod (setup.py) ... [?25l[?25hdone
Created wheel for crcmod: filename=crcmod-1.7-py3-none-any.whl size=18850 sha256=8378f38ffd63e79b51bb7ecd940dad615dc9a2b4cc121bf0f5eb1eaba61dd8db
Stored in directory: /root/.cache/pip/wheels/dc/9a/e9/49e627353476cec8484343c4ab656f1e0d783ee77b9dde2d1f
Building wheel for dill (setup.py) ... [?25l[?25hdone
Created wheel for dill: filename=dill-0.3.1.1-py3-none-any.whl size=78544 sha256=66e33f528bc520a52c1bfbc624a413a62edc87eb8fcdcea68908a5b1fbd85a2d
Stored in directory: /root/.cache/pip/wheels/a4/61/fd/c57e374e580aa78a45ed78d5859b3a44436af17e22ca53284f
Building wheel for google-apitools (setup.py) ... [?25l[?25hdone
Created wheel for google-apitools: filename=google_apitools-0.5.31-py3-none-any.whl size=131040 sha256=3b45a2ad6ddaf5dd0b99ca1887f1b6848069524b49144d71c364d7d7caedeaab
Stored in directory: /root/.cache/pip/wheels/19/b5/2f/1cc3cf2b31e7a9cd1508731212526d9550271274d351c96f16
Building wheel for hdfs (setup.py) ... [?25l[?25hdone
Created wheel for hdfs: filename=hdfs-2.7.2-py3-none-any.whl size=34191 sha256=984d921e31ce51f65d86cd82886f87e772ca6d49b84598a02e5d0a2024e7cbd1
Stored in directory: /root/.cache/pip/wheels/e2/33/27/5b3de59cfbdc8df69a691254cf86847b733db9db78b07c6704
Building wheel for docopt (setup.py) ... [?25l[?25hdone
Created wheel for docopt: filename=docopt-0.6.2-py2.py3-none-any.whl size=13723 sha256=f49424aa677829372a04027263cd07f7786bfc8d546402d1909ab3d533234d7c
Stored in directory: /root/.cache/pip/wheels/72/b0/3f/1d95f96ff986c7dfffe46ce2be4062f38ebd04b506c77c81b9
Successfully built crcmod dill google-apitools hdfs docopt
Installing collected packages: tensorboard-plugin-wit, pytz, libclang, flatbuffers, docopt, crcmod, zstandard, zipp, wrapt, urllib3, uritemplate, typing-extensions, termcolor, tensorflow-io-gcs-filesystem, tensorflow-estimator, tensorboard-data-server, sqlparse, regex, python-dateutil, pyparsing, pymongo, pyasn1, protobuf, packaging, overrides, orjson, objsize, oauthlib, numpy, MarkupSafe, keras, idna, grpcio, google-pasta, google-crc32c, gast, fasteners, fastavro, dill, cloudpickle, charset-normalizer, certifi, cachetools, astunparse, absl-py, werkzeug, rsa, requests, pydot, pyasn1-modules, pyarrow, proto-plus, pandas, opt-einsum, importlib-metadata, httplib2, h5py, googleapis-common-protos, google-resumable-media, tensorflow-metadata, requests-oauthlib, oauth2client, markdown, hdfs, grpcio-status, google-auth, grpc-google-iam-v1, google-auth-oauthlib, google-auth-httplib2, google-apitools, google-api-core, apache-beam, tensorboard, google-cloud-core, google-api-python-client, tensorflow, google-cloud-vision, google-cloud-videointelligence, google-cloud-spanner, google-cloud-recommendations-ai, google-cloud-pubsub, google-cloud-language, google-cloud-dlp, google-cloud-datastore, google-cloud-bigtable, google-cloud-bigquery-storage, google-cloud-bigquery, tensorflow-serving-api, google-cloud-pubsublite, tfx-bsl, tensorflow-transform
Successfully installed MarkupSafe-2.1.3 absl-py-1.4.0 apache-beam-2.46.0 astunparse-1.6.3 cachetools-4.2.4 certifi-2023.7.22 charset-normalizer-3.2.0 cloudpickle-2.2.1 crcmod-1.7 dill-0.3.1.1 docopt-0.6.2 fastavro-1.8.0 fasteners-0.18 flatbuffers-23.5.26 gast-0.4.0 google-api-core-2.11.1 google-api-python-client-1.12.11 google-apitools-0.5.31 google-auth-2.22.0 google-auth-httplib2-0.1.0 google-auth-oauthlib-0.4.6 google-cloud-bigquery-3.11.4 google-cloud-bigquery-storage-2.16.2 google-cloud-bigtable-1.7.3 google-cloud-core-2.3.3 google-cloud-datastore-1.15.5 google-cloud-dlp-3.12.2 google-cloud-language-1.3.2 google-cloud-pubsub-2.18.3 google-cloud-pubsublite-1.7.0 google-cloud-recommendations-ai-0.7.1 google-cloud-spanner-3.40.1 google-cloud-videointelligence-1.16.3 google-cloud-vision-3.4.4 google-crc32c-1.5.0 google-pasta-0.2.0 google-resumable-media-2.5.0 googleapis-common-protos-1.60.0 grpc-google-iam-v1-0.12.6 grpcio-1.57.0 grpcio-status-1.48.2 h5py-3.8.0 hdfs-2.7.2 httplib2-0.21.0 idna-3.4 importlib-metadata-6.7.0 keras-2.11.0 libclang-16.0.6 markdown-3.4.4 numpy-1.21.6 oauth2client-4.1.3 oauthlib-3.2.2 objsize-0.6.1 opt-einsum-3.3.0 orjson-3.9.5 overrides-6.5.0 packaging-23.1 pandas-1.3.5 proto-plus-1.22.3 protobuf-3.19.6 pyarrow-6.0.1 pyasn1-0.5.0 pyasn1-modules-0.3.0 pydot-1.4.2 pymongo-3.13.0 pyparsing-3.1.1 python-dateutil-2.8.2 pytz-2023.3 regex-2023.8.8 requests-2.31.0 requests-oauthlib-1.3.1 rsa-4.9 sqlparse-0.4.4 tensorboard-2.11.2 tensorboard-data-server-0.6.1 tensorboard-plugin-wit-1.8.1 tensorflow-2.11.0 tensorflow-estimator-2.11.0 tensorflow-io-gcs-filesystem-0.33.0 tensorflow-metadata-1.12.0 tensorflow-serving-api-2.11.0 tensorflow-transform-1.12.0 termcolor-2.3.0 tfx-bsl-1.12.0 typing-extensions-4.7.1 uritemplate-3.0.1 urllib3-1.26.16 werkzeug-2.2.3 wrapt-1.15.0 zipp-3.15.0 zstandard-0.21.0
[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv[0m[33m
[0m
Collecting dill==0.3.3
Downloading dill-0.3.3-py2.py3-none-any.whl (81 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.3/81.3 KB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dill
Attempting uninstall: dill
Found existing installation: dill 0.3.1.1
Uninstalling dill-0.3.1.1:
Successfully uninstalled dill-0.3.1.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
apache-beam 2.46.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.3 which is incompatible.[0m[31m
[0mSuccessfully installed dill-0.3.3
[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv[0m[33m
[0m
Note: Please DO NOT restart the runtime after installing the packages. Doing so might cause issues because the Python version was switched. If you restarted it by mistake, kindly go to Runtime > Disconnect and delete runtime
, and then re-run the previous cells.
Next, you’ll define the working directory to contain all the results you will generate in this exercise. After each phase, you can open the Colab file explorer on the left and look under the results
directory to see the new files and directories generated.
# Define working directory
WORK_DIR = "results"
With that, you are now ready to execute the pipeline. As shown in the figure earlier, the logic is already implemented in the four main scripts. You will run them one by one and the next sections will discuss relevant detail and the outputs generated.
The first step is to extract the input data. The dataset is stored as SDF
files and is extracted from the National Center for Biotechnology Information (FTP source). Chapter 6 of this document shows a more detailed description of the SDF file format.
The data-extractor.py
file extracts and decompresses the specified SDF files. In later steps, the example preprocesses these files and uses the data to train and evaluate the machine learning model. The file extracts the SDF files from the public source and stores them in a subdirectory inside the specified working directory.
As you can see here, the complete set of files is huge and can easily exceed storage limits in Colab. For this exercise, you will just download one file. You can use the script as shown in the cells below:
# Print the help documentation. You can ignore references to GCP because you will be running everything in Colab.
!python ./molecules/data-extractor.py --help
python3: can't open file './molecules/data-extractor.py': [Errno 2] No such file or directory
# Run the data extractor
!python ./molecules/data-extractor.py --max-data-files 1 --work-dir={WORK_DIR}
python3: can't open file './molecules/data-extractor.py': [Errno 2] No such file or directory
You should now have a new folder in your work directory called data
. This will contain the SDF file you downloaded.
# List working directory
!ls {WORK_DIR}
ls: cannot access 'results': No such file or directory
In the SDF Documentation linked earlier, it shows that one record is terminated by $$$$
. You can use the command below to print the first one in the file. As you’ll see, just one record is already pretty long. In the next phase, you’ll feed these records in a pipeline that will transform these into a form that can be consumed by our model.
# Print one record
!sed '/$$$$/q' {WORK_DIR}/data/00000001_00025000.sdf
sed: can't read results/data/00000001_00025000.sdf: No such file or directory
The next script: preprocess.py
uses an Apache Beam pipeline to preprocess the data. The pipeline performs the following preprocessing actions:
Apache Beam transforms can efficiently manipulate single elements at a time, but transforms that require a full pass of the dataset cannot easily be done with only Apache Beam and are better done using tf.Transform. Because of this, the code uses Apache Beam transforms to read and format the molecules, and to count the atoms in each molecule. The code then uses tf.Transform
to find the global minimum and maximum counts in order to normalize the data.
The following image shows the steps in the pipeline.
You will run the script first and the following sections will discuss the relevant parts of this code. This will take around 6 minutes to run.
# Print help documentation
!python ./molecules/preprocess.py --help
python3: can't open file './molecules/preprocess.py': [Errno 2] No such file or directory
# Run the preprocessing script
!python ./molecules/preprocess.py --work-dir={WORK_DIR}
python3: can't open file './molecules/preprocess.py': [Errno 2] No such file or directory
You should now have a few more outputs in your work directory. Most important are:
tf.Transform
outputs# List working directory
!ls {WORK_DIR}
ls: cannot access 'results': No such file or directory
The training and evaluation datasets contain TFRecords and you can view them by running the helper function in the cells below.
from google.protobuf.json_format import MessageToDict
# Define a helper function to get individual examples
def get_records(dataset, num_records):
'''Extracts records from the given dataset.
Args:
dataset (TFRecordDataset): dataset saved in the preprocessing step
num_records (int): number of records to preview
'''
# initialize an empty list
records = []
# Use the `take()` method to specify how many records to get
for tfrecord in dataset.take(num_records):
# Get the numpy property of the tensor
serialized_example = tfrecord.numpy()
# Initialize a `tf.train.Example()` to read the serialized data
example = tf.train.Example()
# Read the example data (output is a protocol buffer message)
example.ParseFromString(serialized_example)
# convert the protocol bufffer message to a Python dictionary
example_dict = (MessageToDict(example))
# append to the records list
records.append(example_dict)
return records
import tensorflow as tf
from pprint import pprint
# Create TF Dataset from TFRecord of training set
train_data = tf.data.TFRecordDataset(f'{WORK_DIR}/train-dataset/part-00000-of-00001')
# Print two records
test_data = get_records(train_data, 2)
pprint(test_data)
---------------------------------------------------------------------------
NotFoundError Traceback (most recent call last)
<ipython-input-13-dda645320f70> in <cell line: 8>()
6
7 # Print two records
----> 8 test_data = get_records(train_data, 2)
9
10 pprint(test_data)
<ipython-input-12-68f3a263de80> in get_records(dataset, num_records)
13
14 # Use the `take()` method to specify how many records to get
---> 15 for tfrecord in dataset.take(num_records):
16
17 # Get the numpy property of the tensor
/usr/local/lib/python3.10/dist-packages/tensorflow/python/data/ops/iterator_ops.py in __next__(self)
795 def __next__(self):
796 try:
--> 797 return self._next_internal()
798 except errors.OutOfRangeError:
799 raise StopIteration
/usr/local/lib/python3.10/dist-packages/tensorflow/python/data/ops/iterator_ops.py in _next_internal(self)
778 # to communicate that there is no more data to iterate over.
779 with context.execution_mode(context.SYNC):
--> 780 ret = gen_dataset_ops.iterator_get_next(
781 self._iterator_resource,
782 output_types=self._flat_output_types,
/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/gen_dataset_ops.py in iterator_get_next(iterator, output_types, output_shapes, name)
3014 return _result
3015 except _core._NotOkStatusException as e:
-> 3016 _ops.raise_from_not_ok_status(e, name)
3017 except _core._FallbackException:
3018 pass
/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/ops.py in raise_from_not_ok_status(e, name)
7260 def raise_from_not_ok_status(e, name):
7261 e.message += (" name: " + name if name is not None else "")
-> 7262 raise core._status_to_exception(e) from None # pylint: disable=protected-access
7263
7264
NotFoundError: results/train-dataset/part-00000-of-00001; No such file or directory [Op:IteratorGetNext]
Note: From the output cell above, you might concur that we’ll need more than the atom counts to make better predictions. You’ll notice that the counts are identical in both records but the Energy
value is different. Thus, you cannot expect the model to have a low loss during the training phase later. For simplicity, we’ll just use atom counts in this exercise but feel free to revise later to have more predictive features. You can share your findings in our Discourse community to discuss with other learners who are interested in the same problem.
The PreprocessData
contains Python objects needed in the training phase such as:
These are saved in a serialized file using dill when you ran the preprocess
script earlier and you can deserialize it using the cell below to view its contents.
# Import path to the Python3.7 packages so it can find the dill module
import sys
sys.path.insert(0, r"/usr/local/lib/python3.7/dist-packages")
import dill as pickle
# Helper function to load the serialized file
def load(filename):
with tf.io.gfile.GFile(filename, 'rb') as f:
return pickle.load(f)
# Load PreprocessData
preprocess_data = load('/content/results/PreprocessData')
# Print contents
pprint(vars(preprocess_data))
The next sections will describe how these are implemented as a Beam pipeline in preprocess.py
. You can open this file in a separate text editor so you can look at it more closely while reviewing the snippets below.
The preprocess.py
code creates an Apache Beam pipeline.
# Build and run a Beam Pipeline
with beam.Pipeline(options=beam_options) as p, \
beam_impl.Context(temp_dir=tft_temp_dir):
Next, the code applies a feature_extraction
transform to the pipeline.
# Transform and validate the input data matches the input schema
dataset = (
p
| 'Feature extraction' >> feature_extraction
The pipeline uses SimpleFeatureExtraction
as its feature_extraction
transform.
pubchem.SimpleFeatureExtraction(pubchem.ParseSDF(data_files_pattern)),
The SimpleFeatureExtraction
transform, defined in pubchem/pipeline.py
, contains a series of transforms that manipulate all elements independently. First, the code parses the molecules from the source file, then formats the molecules to a dictionary of molecule properties, and finally, counts the atoms in the molecule. These counts are the features (inputs) for the machine learning model.
class SimpleFeatureExtraction(beam.PTransform):
"""The feature extraction (element-wise transformations).
We create a `PTransform` class. This `PTransform` is a bundle of
transformations that can be applied to any other pipeline as a step.
We'll extract all the raw features here. Due to the nature of `PTransform`s,
we can only do element-wise transformations here. Anything that requires a
full-pass of the data (such as feature scaling) has to be done with
tf.Transform.
"""
def __init__(self, source):
super(SimpleFeatureExtraction, self).__init__()
self.source = source
def expand(self, p):
# Return the preprocessing pipeline. In this case we're reading the PubChem
# files, but the source could be any Apache Beam source.
return (p
| 'Read raw molecules' >> self.source
| 'Format molecule' >> beam.ParDo(FormatMolecule())
| 'Count atoms' >> beam.ParDo(CountAtoms())
)
The read transform beam.io.Read(pubchem.ParseSDF(data_files_pattern))
reads SDF files from a custom source.
The custom source, called ParseSDF
, is defined in pubchem/pipeline.py
. ParseSDF extends FileBasedSource
and implements the read_records
function that opens the extracted SDF files.
The pipeline groups the raw data into sections of relevant information needed for the next steps. Each section in the parsed SDF file is stored in a dictionary (see pipeline/sdf.py
), where the keys are the section names and the values are the raw line contents of the corresponding section.
The code applies beam.ParDo(FormatMolecule())
to the pipeline. The ParDo
applies the DoFn
named FormatMolecule
to each molecule. FormatMolecule
yields a dictionary of formatted molecules. The following snippet is an example of an element in the output PCollection
:
{
'atoms': [
{
'atom_atom_mapping_number': 0,
'atom_stereo_parity': 0,
'atom_symbol': u'O',
'charge': 0,
'exact_change_flag': 0,
'h0_designator': 0,
'hydrogen_count': 0,
'inversion_retention': 0,
'mass_difference': 0,
'stereo_care_box': 0,
'valence': 0,
'x': -0.0782,
'y': -1.5651,
'z': 1.3894,
},
...
],
'bonds': [
{
'bond_stereo': 0,
'bond_topology': 0,
'bond_type': 1,
'first_atom_number': 1,
'reacting_center_status': 0,
'second_atom_number': 5,
},
...
],
'<PUBCHEM_COMPOUND_CID>': ['3\n'],
...
'<PUBCHEM_MMFF94_ENERGY>': ['19.4085\n'],
...
}
Then, the code applies beam.ParDo(CountAtoms())
to the pipeline. The DoFn
CountAtoms
sums the number of carbon, hydrogen, nitrogen, and oxygen atoms each molecule has. CountAtoms
outputs a PCollection
of features and labels. Here is an example of an element in the output PCollection
:
{
'ID': 3,
'TotalC': 7,
'TotalH': 8,
'TotalO': 4,
'TotalN': 0,
'Energy': 19.4085,
}
The pipeline then validates the inputs. The ValidateInputData
DoFn
validates that every element matches the metadata given in the input_schema
. This validation ensures that the data is in the correct format when it’s fed into TensorFlow.
| 'Validate inputs' >> beam.ParDo(ValidateInputData(
input_feature_spec)))
The Molecules code sample uses a Deep Neural Network Regressor to make predictions. The general recommendation is to normalize the inputs before feeding them into the ML model. The pipeline uses tf.Transform
to normalize the counts of each atom to values between 0 and 1. To read more about normalizing inputs, see feature scaling.
Normalizing the values requires a full pass through the dataset, recording the minimum and maximum values. The code uses tf.Transform
to go through the entire dataset and apply full-pass transforms.
To use tf.Transform
, the code must provide a function that contains the logic of the transform to perform on the dataset. In preprocess.py
, the code uses the AnalyzeAndTransformDataset
transform provided by tf.Transform
. Learn more about how to use tf.Transform.
# Apply the tf.Transform preprocessing_fn
input_metadata = dataset_metadata.DatasetMetadata(
dataset_schema.from_feature_spec(input_feature_spec))
dataset_and_metadata, transform_fn = (
(dataset, input_metadata)
| 'Feature scaling' >> beam_impl.AnalyzeAndTransformDataset(
feature_scaling))
dataset, metadata = dataset_and_metadata
In preprocess.py
, the feature_scaling
function used is normalize_inputs
, which is defined in pubchem/pipeline.py
. The function uses the tf.Transform
function scale_to_0_1
to normalize the counts to values between 0 and 1.
def normalize_inputs(inputs):
"""Preprocessing function for tf.Transform (full-pass transformations).
Here we will do any preprocessing that requires a full-pass of the dataset.
It takes as inputs the preprocessed data from the `PTransform` we specify, in
this case `SimpleFeatureExtraction`.
Common operations might be scaling values to 0-1, getting the minimum or
maximum value of a certain field, creating a vocabulary for a string field.
There are two main types of transformations supported by tf.Transform, for
more information, check the following modules:
- analyzers: tensorflow_transform.analyzers.py
- mappers: tensorflow_transform.mappers.py
Any transformation done in tf.Transform will be embedded into the TensorFlow
model itself.
"""
return {
# Scale the input features for normalization
'NormalizedC': tft.scale_to_0_1(inputs['TotalC']),
'NormalizedH': tft.scale_to_0_1(inputs['TotalH']),
'NormalizedO': tft.scale_to_0_1(inputs['TotalO']),
'NormalizedN': tft.scale_to_0_1(inputs['TotalN']),
# Do not scale the label since we want the absolute number for prediction
'Energy': inputs['Energy'],
}
Next, the preprocess.py
pipeline partitions the single dataset into two datasets. It allocates approximately 80% of the data to be used as training data, and approximately 20% of the data to be used as evaluation data.
# Split the dataset into a training set and an evaluation set
assert 0 < eval_percent < 100, 'eval_percent must in the range (0-100)'
train_dataset, eval_dataset = (
dataset
| 'Split dataset' >> beam.Partition(
lambda elem, _: int(random.uniform(0, 100) < eval_percent), 2))
Finally, the preprocess.py
pipeline writes the two datasets (training and evaluation) using the WriteToTFRecord
transform.
# Write the datasets as TFRecords
coder = example_proto_coder.ExampleProtoCoder(metadata.schema)
train_dataset_prefix = os.path.join(train_dataset_dir, 'part')
_ = (
train_dataset
| 'Write train dataset' >> tfrecordio.WriteToTFRecord(
train_dataset_prefix, coder))
eval_dataset_prefix = os.path.join(eval_dataset_dir, 'part')
_ = (
eval_dataset
| 'Write eval dataset' >> tfrecordio.WriteToTFRecord(
eval_dataset_prefix, coder))
# Write the transform_fn
_ = (
transform_fn
| 'Write transformFn' >> transform_fn_io.WriteTransformFn(work_dir))
Recall that at the end of the preprocessing phase, the code split the data into two datasets (training and evaluation).
The script uses a simple dense neural network for the regression problem. The trainer/task.py
file contains the code for training the model. The main function of trainer/task.py
loads the parameters needed from the preprocessing phase and passes it to the task runner function (i.e. run_fn
).
In this exercise, we will not focus too much on the training metrics (e.g. accuracy). That is discussed in other courses of this specialization. The main objective is to look at the outputs and how it is connected to the prediction phase.
# Print help documentation
!python ./molecules/trainer/task.py --help
# Run the trainer.
!python ./molecules/trainer/task.py --work-dir {WORK_DIR}
The outputs of this phase are in the model
directory. This will be the trained model that you will use for predictions.
!ls {WORK_DIR}/model
The important thing to note in the training script is it also exports the transformation graph with the model. That is shown in these lines:
# Define default serving signature
signatures = {
'serving_default':
_get_serve_tf_examples_fn(model,
tf_transform_output, input_feature_spec).get_concrete_function(
[signatures_dict])
}
# Save model with signature
model.save(fn_args.serving_model_dir, save_format='tf', signatures=signatures, include_optimizer=False)
The implementation of _get_serve_tf_examples_fn()
is as follows:
def _get_serve_tf_examples_fn(model, tf_transform_output, feature_spec):
"""Returns a function that applies data transformation and generates predictions"""
# Get transformation graph
model.tft_layer = tf_transform_output.transform_features_layer()
@tf.function
def serve_tf_examples_fn(inputs_list):
"""Returns the output to be used in the serving signature."""
# Create a shallow copy of the dictionary in the single element list
inputs = inputs_list[0].copy()
# Pop ID since it is not needed in the transformation graph
# Also needed to identify predictions
id_key = inputs.pop('ID')
# Apply data transformation to the raw inputs
transformed = model.tft_layer(inputs)
# Pass the transformed data to the model to get predictions
predictions = model(transformed.values())
return id_key, predictions
return serve_tf_examples_fn
The use of model.tft_layer
means that your model can accept raw data and it will do the transformation before feeding it to make predictions. It implies that when you serve your model for predictions, you don’t have to worry about creating a pipeline to transform new data coming in. The model will already do that for you through this serving input function. It helps to prevent training-serving skew since you’re handling the training and serving data the same way.
After training the model, you can provide the model with inputs and it will make predictions. The pipeline in predict.py
is responsible for making predictions. It reads the input files from the custom source and writes the output predictions as text files to the specified working directory.
if args.verb == 'batch':
data_files_pattern = os.path.join(args.inputs_dir, '*.sdf')
results_prefix = os.path.join(args.outputs_dir, 'part')
source = pubchem.ParseSDF(data_files_pattern)
sink = beam.io.WriteToText(results_prefix)
The following image shows the steps in the prediction pipeline:
In predict.py
, the code defines the pipeline in the run function:
def run(model_dir, feature_extraction, sink, beam_options=None):
print('Listening...')
with beam.Pipeline(options=beam_options) as p:
_ = (p
| 'Feature extraction' >> feature_extraction
| 'Predict' >> beam.ParDo(Predict(model_dir, 'ID'))
| 'Format as JSON' >> beam.Map(json.dumps)
| 'Write predictions' >> sink)
The code calls the run function with the following parameters:
run(
args.model_dir,
pubchem.SimpleFeatureExtraction(source),
sink,
beam_options)
First, the code passes the pubchem.SimpleFeatureExtraction(source)
transform as the feature_extraction
transform. This transform, which was also used in the preprocessing phase, is applied to the pipeline:
class SimpleFeatureExtraction(beam.PTransform):
"""The feature extraction (element-wise transformations).
We create a `PTransform` class. This `PTransform` is a bundle of
transformations that can be applied to any other pipeline as a step.
We'll extract all the raw features here. Due to the nature of `PTransform`s,
we can only do element-wise transformations here. Anything that requires a
full-pass of the data (such as feature scaling) has to be done with
tf.Transform.
"""
def __init__(self, source):
super(SimpleFeatureExtraction, self).__init__()
self.source = source
def expand(self, p):
# Return the preprocessing pipeline. In this case we're reading the PubChem
# files, but the source could be any Apache Beam source.
return (p
| 'Read raw molecules' >> self.source
| 'Format molecule' >> beam.ParDo(FormatMolecule())
| 'Count atoms' >> beam.ParDo(CountAtoms())
)
The transform reads from the appropriate source based on the pipeline’s execution mode (i.e. batch), formats the molecules, and counts the different atoms in each molecule.
Next, beam.ParDo(Predict(…))
is applied to the pipeline that performs the prediction of the molecular energy. Predict
, the DoFn
that’s passed, uses the given dictionary of input features (atom counts), to predict the molecular energy.
The next transform applied to the pipeline is beam.Map(lambda result: json.dumps(result))
, which takes the prediction result dictionary and serializes it into a JSON string. Finally, the output is written to the sink.
Batch predictions are optimized for throughput rather than latency. Batch predictions work best if you’re making many predictions and you can wait for all of them to finish before getting the results. You can run the following cells to use the script to run batch predictions. For simplicity, you will use the same file you used for training. If you want however, you can use the data extractor script earlier to grab a different SDF file and feed it here.
# Print help documentation. You can ignore references to GCP and streaming data.
!python ./molecules/predict.py --help
# Define model, input and output data directories
MODEL_DIR = f'{WORK_DIR}/model'
DATA_DIR = f'{WORK_DIR}/data'
PRED_DIR = f'{WORK_DIR}/predictions'
# Run batch prediction. This can take up to 7 minutes.
!python ./molecules/predict.py \
--model-dir {MODEL_DIR} \
--work-dir {WORK_DIR} \
batch \
--inputs-dir {DATA_DIR} \
--outputs-dir {PRED_DIR}
The results should now be in the predictions
folder. This is just a text file so you can easily print the output.
# List working directory
!ls {WORK_DIR}
# Print the first 100 results
!head -n 100 /content/results/predictions/part-00000-of-00001
You’ve now completed all phases of the Beam-based pipeline! Similar processes are done under the hood by higher-level frameworks such as TFX and you can use the techniques here to understand their codebase better or to extend them for your own needs. As mentioned earlier, the original article also offers the option to use GCP and to perform online predictions as well. Feel free to try it out but be aware of the recurring costs.
On to the next part of the course!