Coursera

Week 2 Assignment: Feature Engineering

For this week’s assignment, you will build a data pipeline using using Tensorflow Extended (TFX) to prepare features from the Metro Interstate Traffic Volume dataset. Try to only use the documentation and code hints to accomplish the tasks but feel free to review the 2nd ungraded lab this week in case you get stuck.

Upon completion, you will have:

Let’s begin!

Table of Contents

# IMPORTANT: This will check your notebook's metadata for grading.
# Please do not continue the lab unless the output of this cell tells you to proceed. 
!python add_metadata.py --filename C2W2_Assignment.ipynb
Grader metadata detected! You can proceed with the lab!

NOTE: To prevent errors from the autograder, you are not allowed to edit or delete non-graded cells in this notebook . Please only put your solutions in between the ### START CODE HERE and ### END CODE HERE code comments, and also refrain from adding any new cells. Once you have passed this assignment and want to experiment with any of the non-graded code, you may follow the instructions at the bottom of this notebook.

1 - Setup

As usual, you will first need to import the necessary packages. For reference, the lab environment uses TensorFlow version: 2.6 and TFX version: 1.3.

1.1 Imports

# grader-required-cell

import os

import tensorflow as tf

from tfx import v1 as tfx
import tensorflow_transform.beam as tft_beam
from google.protobuf.json_format import MessageToDict
from tensorflow_transform.tf_metadata import dataset_metadata, schema_utils
from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext

import tempfile
import pprint
import warnings

pp = pprint.PrettyPrinter()

# ignore tf warning messages
tf.get_logger().setLevel('ERROR')
warnings.filterwarnings("ignore")

1.2 - Define paths

You will define a few global variables to indicate paths in the local workspace.

# grader-required-cell

# location of the pipeline metadata store
_pipeline_root = './pipeline'

# directory of the raw data files
_data_root = './data'

# path to the raw training data
_data_filepath = os.path.join(_data_root, 'metro_traffic_volume.csv')

1.3 - Preview the dataset

The Metro Interstate Traffic Volume dataset contains hourly traffic volume of a road in Minnesota from 2012-2018. With this data, you can develop a model for predicting the traffic volume given the date, time, and weather conditions. The attributes are:

Disclaimer: We added the last four attributes shown above (i.e. month, day, day_of_week, hour) to the original dataset to increase the features you can transform later.

Take a quick look at the first few rows of the CSV file.

# grader-required-cell

# Preview the dataset
!head {_data_filepath}
holiday,temp,rain_1h,snow_1h,clouds_all,weather_main,weather_description,date_time,traffic_volume,month,day,day_of_week,hour
None,288.28,0.0,0.0,40,Clouds,scattered clouds,2012-10-02 09:00:00,5545,10,2,1,9
None,289.36,0.0,0.0,75,Clouds,broken clouds,2012-10-02 10:00:00,4516,10,2,1,10
None,289.58,0.0,0.0,90,Clouds,overcast clouds,2012-10-02 11:00:00,4767,10,2,1,11
None,290.13,0.0,0.0,90,Clouds,overcast clouds,2012-10-02 12:00:00,5026,10,2,1,12
None,291.14,0.0,0.0,75,Clouds,broken clouds,2012-10-02 13:00:00,4918,10,2,1,13
None,291.72,0.0,0.0,1,Clear,sky is clear,2012-10-02 14:00:00,5181,10,2,1,14
None,293.17,0.0,0.0,1,Clear,sky is clear,2012-10-02 15:00:00,5584,10,2,1,15
None,293.86,0.0,0.0,1,Clear,sky is clear,2012-10-02 16:00:00,6015,10,2,1,16
None,294.14,0.0,0.0,20,Clouds,few clouds,2012-10-02 17:00:00,5791,10,2,1,17

1.4 - Create the InteractiveContext

You will need to initialize the InteractiveContext to enable running the TFX components interactively. As before, you will let it create the metadata store in the _pipeline_root directory. You can safely ignore the warning about the missing metadata config file.

# grader-required-cell

# Declare the InteractiveContext and use a local sqlite file as the metadata store.
# You can ignore the warning about the missing metadata config file
context = InteractiveContext(pipeline_root=_pipeline_root)
WARNING:absl:InteractiveContext metadata_connection_config not provided: using SQLite ML Metadata database at ./pipeline/metadata.sqlite.

2 - Run TFX components interactively

In the following exercises, you will create the data pipeline components one-by-one, run each of them, and visualize their output artifacts. Recall that we refer to the outputs of pipeline components as artifacts and these can be inputs to the next stage of the pipeline.

2.1 - ExampleGen

The pipeline starts with the ExampleGen component. It will:

Exercise 1: ExampleGen

Fill out the code below to ingest the data from the CSV file stored in the _data_root directory.

# grader-required-cell

### START CODE HERE

# Instantiate ExampleGen with the input CSV dataset
example_gen = tfx.components.CsvExampleGen(input_base=_data_root)

# Run the component using the InteractiveContext instance
context.run(example_gen)

### END CODE HERE
WARNING:root:Make sure that locally built Python SDK docker image has Python 3.8 interpreter.
<style> .tfx-object.expanded { padding: 4px 8px 4px 8px; background: white; border: 1px solid #bbbbbb; box-shadow: 4px 4px 2px rgba(0,0,0,0.05); } .tfx-object, .tfx-object * { font-size: 11pt; } .tfx-object > .title { cursor: pointer; } .tfx-object .expansion-marker { color: #999999; } .tfx-object.expanded > .title > .expansion-marker:before { content: '▼'; } .tfx-object.collapsed > .title > .expansion-marker:before { content: '▶'; } .tfx-object .class-name { font-weight: bold; } .tfx-object .deemphasize { opacity: 0.5; } .tfx-object.collapsed > table.attr-table { display: none; } .tfx-object.expanded > table.attr-table { display: block; } .tfx-object table.attr-table { border: 2px solid white; margin-top: 5px; } .tfx-object table.attr-table td.attr-name { vertical-align: top; font-weight: bold; } .tfx-object table.attr-table td.attrvalue { text-align: left; } <script> function toggleTfxObject(element) { var objElement = element.parentElement; if (objElement.classList.contains('collapsed')) { objElement.classList.remove('collapsed'); objElement.classList.add('expanded'); } else { objElement.classList.add('collapsed'); objElement.classList.remove('expanded'); } }
ExecutionResult at 0x7f0760172580
.execution_id6
.component<style> .tfx-object.expanded { padding: 4px 8px 4px 8px; background: white; border: 1px solid #bbbbbb; box-shadow: 4px 4px 2px rgba(0,0,0,0.05); } .tfx-object, .tfx-object * { font-size: 11pt; } .tfx-object > .title { cursor: pointer; } .tfx-object .expansion-marker { color: #999999; } .tfx-object.expanded > .title > .expansion-marker:before { content: '▼'; } .tfx-object.collapsed > .title > .expansion-marker:before { content: '▶'; } .tfx-object .class-name { font-weight: bold; } .tfx-object .deemphasize { opacity: 0.5; } .tfx-object.collapsed > table.attr-table { display: none; } .tfx-object.expanded > table.attr-table { display: block; } .tfx-object table.attr-table { border: 2px solid white; margin-top: 5px; } .tfx-object table.attr-table td.attr-name { vertical-align: top; font-weight: bold; } .tfx-object table.attr-table td.attrvalue { text-align: left; } <script> function toggleTfxObject(element) { var objElement = element.parentElement; if (objElement.classList.contains('collapsed')) { objElement.classList.remove('collapsed'); objElement.classList.add('expanded'); } else { objElement.classList.add('collapsed'); objElement.classList.remove('expanded'); } }
.component.inputs{}
.component.outputs
['examples']<style> .tfx-object.expanded { padding: 4px 8px 4px 8px; background: white; border: 1px solid #bbbbbb; box-shadow: 4px 4px 2px rgba(0,0,0,0.05); } .tfx-object, .tfx-object * { font-size: 11pt; } .tfx-object > .title { cursor: pointer; } .tfx-object .expansion-marker { color: #999999; } .tfx-object.expanded > .title > .expansion-marker:before { content: '▼'; } .tfx-object.collapsed > .title > .expansion-marker:before { content: '▶'; } .tfx-object .class-name { font-weight: bold; } .tfx-object .deemphasize { opacity: 0.5; } .tfx-object.collapsed > table.attr-table { display: none; } .tfx-object.expanded > table.attr-table { display: block; } .tfx-object table.attr-table { border: 2px solid white; margin-top: 5px; } .tfx-object table.attr-table td.attr-name { vertical-align: top; font-weight: bold; } .tfx-object table.attr-table td.attrvalue { text-align: left; } <script> function toggleTfxObject(element) { var objElement = element.parentElement; if (objElement.classList.contains('collapsed')) { objElement.classList.remove('collapsed'); objElement.classList.add('expanded'); } else { objElement.classList.add('collapsed'); objElement.classList.remove('expanded'); } }

You should see the output cell of the InteractiveContext above showing the metadata associated with the component execution. You can expand the items under .component.outputs and see that an Examples artifact for the train and eval split is created in metro_traffic_pipeline/CsvExampleGen/examples/{execution_id}.

You can also check that programmatically with the following snippet. You can focus on the try block. The except and else block is needed mainly for grading. context.run() yields no operation when executed in a non-interactive environment (such as the grader script that runs outside of this notebook). In such scenarios, the URI must be manually set to avoid errors.

# grader-required-cell

try:
    # get the artifact object
    artifact = example_gen.outputs['examples'].get()[0]
    
    # print split names and uri
    print(f'split names: {artifact.split_names}')
    print(f'artifact uri: {artifact.uri}')

# for grading since context.run() does not work outside the notebook
except IndexError:
    print("context.run() was no-op")
    examples_path = './pipeline/CsvExampleGen/examples'
    dir_id = os.listdir(examples_path)[0]
    artifact_uri = f'{examples_path}/{dir_id}'

else:
    artifact_uri = artifact.uri
split names: ["train", "eval"]
artifact uri: ./pipeline/CsvExampleGen/examples/6

The ingested data has been saved to the directory specified by the artifact Uniform Resource Identifier (URI). As a sanity check, you can take a look at some of the training examples. This requires working with Tensorflow data types, particularly tf.train.Example and TFRecord (you can read more about them here). Let’s first load the TFRecord into a variable:

# grader-required-cell

# Get the URI of the output artifact representing the training examples, which is a directory
train_uri = os.path.join(artifact_uri, 'Split-train')

# Get the list of files in this directory (all compressed TFRecord files)
tfrecord_filenames = [os.path.join(train_uri, name)
                      for name in os.listdir(train_uri)]

# Create a `TFRecordDataset` to read these files
dataset = tf.data.TFRecordDataset(tfrecord_filenames, compression_type="GZIP")

Exercise 2: get_records()

Complete the helper function below to return a specified number of examples.

Hints: You may find the MessageToDict helper function and tf.train.Example’s ParseFromString() method useful here. You can also refer here for a refresher on TFRecord and tf.train.Example()

# grader-required-cell

def get_records(dataset, num_records):
    '''Extracts records from the given dataset.
    Args:
        dataset (TFRecordDataset): dataset saved by ExampleGen
        num_records (int): number of records to preview
    '''
    
    # initialize an empty list
    records = []

    ### START CODE HERE
    # Use the `take()` method to specify how many records to get
    for tfrecord in dataset.take(num_records):
        
        # Get the numpy property of the tensor
        serialized_example = tfrecord.numpy()
        
        # Initialize a `tf.train.Example()` to read the serialized data
        example = tf.train.Example()
        
        # Read the example data (output is a protocol buffer message)
        example.ParseFromString(serialized_example)
        
        # convert the protocol bufffer message to a Python dictionary
        example_dict = (MessageToDict(example))
        
        # append to the records list
        records.append(example_dict)
        
    ### END CODE HERE
    return records
# grader-required-cell

# Get 3 records from the dataset
sample_records = get_records(dataset, 3)

# Print the output
pp.pprint(sample_records)
[{'features': {'feature': {'clouds_all': {'int64List': {'value': ['40']}},
                           'date_time': {'bytesList': {'value': ['MjAxMi0xMC0wMiAwOTowMDowMA==']}},
                           'day': {'int64List': {'value': ['2']}},
                           'day_of_week': {'int64List': {'value': ['1']}},
                           'holiday': {'bytesList': {'value': ['Tm9uZQ==']}},
                           'hour': {'int64List': {'value': ['9']}},
                           'month': {'int64List': {'value': ['10']}},
                           'rain_1h': {'floatList': {'value': [0.0]}},
                           'snow_1h': {'floatList': {'value': [0.0]}},
                           'temp': {'floatList': {'value': [288.28]}},
                           'traffic_volume': {'int64List': {'value': ['5545']}},
                           'weather_description': {'bytesList': {'value': ['c2NhdHRlcmVkIGNsb3Vkcw==']}},
                           'weather_main': {'bytesList': {'value': ['Q2xvdWRz']}}}}},
 {'features': {'feature': {'clouds_all': {'int64List': {'value': ['75']}},
                           'date_time': {'bytesList': {'value': ['MjAxMi0xMC0wMiAxMDowMDowMA==']}},
                           'day': {'int64List': {'value': ['2']}},
                           'day_of_week': {'int64List': {'value': ['1']}},
                           'holiday': {'bytesList': {'value': ['Tm9uZQ==']}},
                           'hour': {'int64List': {'value': ['10']}},
                           'month': {'int64List': {'value': ['10']}},
                           'rain_1h': {'floatList': {'value': [0.0]}},
                           'snow_1h': {'floatList': {'value': [0.0]}},
                           'temp': {'floatList': {'value': [289.36]}},
                           'traffic_volume': {'int64List': {'value': ['4516']}},
                           'weather_description': {'bytesList': {'value': ['YnJva2VuIGNsb3Vkcw==']}},
                           'weather_main': {'bytesList': {'value': ['Q2xvdWRz']}}}}},
 {'features': {'feature': {'clouds_all': {'int64List': {'value': ['90']}},
                           'date_time': {'bytesList': {'value': ['MjAxMi0xMC0wMiAxMTowMDowMA==']}},
                           'day': {'int64List': {'value': ['2']}},
                           'day_of_week': {'int64List': {'value': ['1']}},
                           'holiday': {'bytesList': {'value': ['Tm9uZQ==']}},
                           'hour': {'int64List': {'value': ['11']}},
                           'month': {'int64List': {'value': ['10']}},
                           'rain_1h': {'floatList': {'value': [0.0]}},
                           'snow_1h': {'floatList': {'value': [0.0]}},
                           'temp': {'floatList': {'value': [289.58]}},
                           'traffic_volume': {'int64List': {'value': ['4767']}},
                           'weather_description': {'bytesList': {'value': ['b3ZlcmNhc3QgY2xvdWRz']}},
                           'weather_main': {'bytesList': {'value': ['Q2xvdWRz']}}}}}]

You should see three of the examples printed above. Now that ExampleGen has finished ingesting the data, the next step is data analysis.

2.2 - StatisticsGen

The StatisticsGen component computes statistics over your dataset for data analysis, as well as for use in downstream components. It uses the TensorFlow Data Validation library.

StatisticsGen takes as input the dataset ingested using CsvExampleGen.

Exercise 3: StatisticsGen

Fill the code below to generate statistics from the output examples of CsvExampleGen.

# grader-required-cell

### START CODE HERE
# Instantiate StatisticsGen with the ExampleGen ingested dataset
statistics_gen = tfx.components.StatisticsGen(examples=example_gen.outputs["examples"])
    

# Run the component
context.run(statistics_gen)
### END CODE HERE
WARNING:root:Make sure that locally built Python SDK docker image has Python 3.8 interpreter.
<style> .tfx-object.expanded { padding: 4px 8px 4px 8px; background: white; border: 1px solid #bbbbbb; box-shadow: 4px 4px 2px rgba(0,0,0,0.05); } .tfx-object, .tfx-object * { font-size: 11pt; } .tfx-object > .title { cursor: pointer; } .tfx-object .expansion-marker { color: #999999; } .tfx-object.expanded > .title > .expansion-marker:before { content: '▼'; } .tfx-object.collapsed > .title > .expansion-marker:before { content: '▶'; } .tfx-object .class-name { font-weight: bold; } .tfx-object .deemphasize { opacity: 0.5; } .tfx-object.collapsed > table.attr-table { display: none; } .tfx-object.expanded > table.attr-table { display: block; } .tfx-object table.attr-table { border: 2px solid white; margin-top: 5px; } .tfx-object table.attr-table td.attr-name { vertical-align: top; font-weight: bold; } .tfx-object table.attr-table td.attrvalue { text-align: left; } <script> function toggleTfxObject(element) { var objElement = element.parentElement; if (objElement.classList.contains('collapsed')) { objElement.classList.remove('collapsed'); objElement.classList.add('expanded'); } else { objElement.classList.add('collapsed'); objElement.classList.remove('expanded'); } }
ExecutionResult at 0x7f07510a9a60
.execution_id7
.component<style> .tfx-object.expanded { padding: 4px 8px 4px 8px; background: white; border: 1px solid #bbbbbb; box-shadow: 4px 4px 2px rgba(0,0,0,0.05); } .tfx-object, .tfx-object * { font-size: 11pt; } .tfx-object > .title { cursor: pointer; } .tfx-object .expansion-marker { color: #999999; } .tfx-object.expanded > .title > .expansion-marker:before { content: '▼'; } .tfx-object.collapsed > .title > .expansion-marker:before { content: '▶'; } .tfx-object .class-name { font-weight: bold; } .tfx-object .deemphasize { opacity: 0.5; } .tfx-object.collapsed > table.attr-table { display: none; } .tfx-object.expanded > table.attr-table { display: block; } .tfx-object table.attr-table { border: 2px solid white; margin-top: 5px; } .tfx-object table.attr-table td.attr-name { vertical-align: top; font-weight: bold; } .tfx-object table.attr-table td.attrvalue { text-align: left; } <script> function toggleTfxObject(element) { var objElement = element.parentElement; if (objElement.classList.contains('collapsed')) { objElement.classList.remove('collapsed'); objElement.classList.add('expanded'); } else { objElement.classList.add('collapsed'); objElement.classList.remove('expanded'); } }
.component.inputs
['examples']<style> .tfx-object.expanded { padding: 4px 8px 4px 8px; background: white; border: 1px solid #bbbbbb; box-shadow: 4px 4px 2px rgba(0,0,0,0.05); } .tfx-object, .tfx-object * { font-size: 11pt; } .tfx-object > .title { cursor: pointer; } .tfx-object .expansion-marker { color: #999999; } .tfx-object.expanded > .title > .expansion-marker:before { content: '▼'; } .tfx-object.collapsed > .title > .expansion-marker:before { content: '▶'; } .tfx-object .class-name { font-weight: bold; } .tfx-object .deemphasize { opacity: 0.5; } .tfx-object.collapsed > table.attr-table { display: none; } .tfx-object.expanded > table.attr-table { display: block; } .tfx-object table.attr-table { border: 2px solid white; margin-top: 5px; } .tfx-object table.attr-table td.attr-name { vertical-align: top; font-weight: bold; } .tfx-object table.attr-table td.attrvalue { text-align: left; } <script> function toggleTfxObject(element) { var objElement = element.parentElement; if (objElement.classList.contains('collapsed')) { objElement.classList.remove('collapsed'); objElement.classList.add('expanded'); } else { objElement.classList.add('collapsed'); objElement.classList.remove('expanded'); } }
.component.outputs
['statistics']<style> .tfx-object.expanded { padding: 4px 8px 4px 8px; background: white; border: 1px solid #bbbbbb; box-shadow: 4px 4px 2px rgba(0,0,0,0.05); } .tfx-object, .tfx-object * { font-size: 11pt; } .tfx-object > .title { cursor: pointer; } .tfx-object .expansion-marker { color: #999999; } .tfx-object.expanded > .title > .expansion-marker:before { content: '▼'; } .tfx-object.collapsed > .title > .expansion-marker:before { content: '▶'; } .tfx-object .class-name { font-weight: bold; } .tfx-object .deemphasize { opacity: 0.5; } .tfx-object.collapsed > table.attr-table { display: none; } .tfx-object.expanded > table.attr-table { display: block; } .tfx-object table.attr-table { border: 2px solid white; margin-top: 5px; } .tfx-object table.attr-table td.attr-name { vertical-align: top; font-weight: bold; } .tfx-object table.attr-table td.attrvalue { text-align: left; } <script> function toggleTfxObject(element) { var objElement = element.parentElement; if (objElement.classList.contains('collapsed')) { objElement.classList.remove('collapsed'); objElement.classList.add('expanded'); } else { objElement.classList.add('collapsed'); objElement.classList.remove('expanded'); } }
# grader-required-cell

# Plot the statistics generated
context.show(statistics_gen.outputs['statistics'])

Artifact at ./pipeline/StatisticsGen/statistics/7

'train' split:

<iframe id='facets-iframe' width="100%" height="500px"> <script> facets_iframe = document.getElementById('facets-iframe'); facets_html = '<script src="https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"><\/script>'; facets_iframe.srcdoc = facets_html; facets_iframe.id = ""; setTimeout(() => { facets_iframe.setAttribute('height', facets_iframe.contentWindow.document.body.offsetHeight + 'px') }, 1500)
'eval' split:

<iframe id='facets-iframe' width="100%" height="500px"> <script> facets_iframe = document.getElementById('facets-iframe'); facets_html = '<script src="https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"><\/script>'; facets_iframe.srcdoc = facets_html; facets_iframe.id = ""; setTimeout(() => { facets_iframe.setAttribute('height', facets_iframe.contentWindow.document.body.offsetHeight + 'px') }, 1500)

2.3 - SchemaGen

The SchemaGen component also uses TFDV to generate a schema based on your data statistics. As you’ve learned previously, a schema defines the expected bounds, types, and properties of the features in your dataset.

SchemaGen will take as input the statistics that we generated with StatisticsGen, looking at the training split by default.

Exercise 4: SchemaGen

# grader-required-cell

### START CODE HERE
# Instantiate SchemaGen with the output statistics from the StatisticsGen
schema_gen = tfx.components.SchemaGen(statistics=statistics_gen.outputs["statistics"])
    
    

# Run the component
context.run(schema_gen)
### END CODE HERE
<style> .tfx-object.expanded { padding: 4px 8px 4px 8px; background: white; border: 1px solid #bbbbbb; box-shadow: 4px 4px 2px rgba(0,0,0,0.05); } .tfx-object, .tfx-object * { font-size: 11pt; } .tfx-object > .title { cursor: pointer; } .tfx-object .expansion-marker { color: #999999; } .tfx-object.expanded > .title > .expansion-marker:before { content: '▼'; } .tfx-object.collapsed > .title > .expansion-marker:before { content: '▶'; } .tfx-object .class-name { font-weight: bold; } .tfx-object .deemphasize { opacity: 0.5; } .tfx-object.collapsed > table.attr-table { display: none; } .tfx-object.expanded > table.attr-table { display: block; } .tfx-object table.attr-table { border: 2px solid white; margin-top: 5px; } .tfx-object table.attr-table td.attr-name { vertical-align: top; font-weight: bold; } .tfx-object table.attr-table td.attrvalue { text-align: left; } <script> function toggleTfxObject(element) { var objElement = element.parentElement; if (objElement.classList.contains('collapsed')) { objElement.classList.remove('collapsed'); objElement.classList.add('expanded'); } else { objElement.classList.add('collapsed'); objElement.classList.remove('expanded'); } }
ExecutionResult at 0x7f0750d07760
.execution_id8
.component<style> .tfx-object.expanded { padding: 4px 8px 4px 8px; background: white; border: 1px solid #bbbbbb; box-shadow: 4px 4px 2px rgba(0,0,0,0.05); } .tfx-object, .tfx-object * { font-size: 11pt; } .tfx-object > .title { cursor: pointer; } .tfx-object .expansion-marker { color: #999999; } .tfx-object.expanded > .title > .expansion-marker:before { content: '▼'; } .tfx-object.collapsed > .title > .expansion-marker:before { content: '▶'; } .tfx-object .class-name { font-weight: bold; } .tfx-object .deemphasize { opacity: 0.5; } .tfx-object.collapsed > table.attr-table { display: none; } .tfx-object.expanded > table.attr-table { display: block; } .tfx-object table.attr-table { border: 2px solid white; margin-top: 5px; } .tfx-object table.attr-table td.attr-name { vertical-align: top; font-weight: bold; } .tfx-object table.attr-table td.attrvalue { text-align: left; } <script> function toggleTfxObject(element) { var objElement = element.parentElement; if (objElement.classList.contains('collapsed')) { objElement.classList.remove('collapsed'); objElement.classList.add('expanded'); } else { objElement.classList.add('collapsed'); objElement.classList.remove('expanded'); } }
.component.inputs
['statistics']<style> .tfx-object.expanded { padding: 4px 8px 4px 8px; background: white; border: 1px solid #bbbbbb; box-shadow: 4px 4px 2px rgba(0,0,0,0.05); } .tfx-object, .tfx-object * { font-size: 11pt; } .tfx-object > .title { cursor: pointer; } .tfx-object .expansion-marker { color: #999999; } .tfx-object.expanded > .title > .expansion-marker:before { content: '▼'; } .tfx-object.collapsed > .title > .expansion-marker:before { content: '▶'; } .tfx-object .class-name { font-weight: bold; } .tfx-object .deemphasize { opacity: 0.5; } .tfx-object.collapsed > table.attr-table { display: none; } .tfx-object.expanded > table.attr-table { display: block; } .tfx-object table.attr-table { border: 2px solid white; margin-top: 5px; } .tfx-object table.attr-table td.attr-name { vertical-align: top; font-weight: bold; } .tfx-object table.attr-table td.attrvalue { text-align: left; } <script> function toggleTfxObject(element) { var objElement = element.parentElement; if (objElement.classList.contains('collapsed')) { objElement.classList.remove('collapsed'); objElement.classList.add('expanded'); } else { objElement.classList.add('collapsed'); objElement.classList.remove('expanded'); } }
.component.outputs
['schema']<style> .tfx-object.expanded { padding: 4px 8px 4px 8px; background: white; border: 1px solid #bbbbbb; box-shadow: 4px 4px 2px rgba(0,0,0,0.05); } .tfx-object, .tfx-object * { font-size: 11pt; } .tfx-object > .title { cursor: pointer; } .tfx-object .expansion-marker { color: #999999; } .tfx-object.expanded > .title > .expansion-marker:before { content: '▼'; } .tfx-object.collapsed > .title > .expansion-marker:before { content: '▶'; } .tfx-object .class-name { font-weight: bold; } .tfx-object .deemphasize { opacity: 0.5; } .tfx-object.collapsed > table.attr-table { display: none; } .tfx-object.expanded > table.attr-table { display: block; } .tfx-object table.attr-table { border: 2px solid white; margin-top: 5px; } .tfx-object table.attr-table td.attr-name { vertical-align: top; font-weight: bold; } .tfx-object table.attr-table td.attrvalue { text-align: left; } <script> function toggleTfxObject(element) { var objElement = element.parentElement; if (objElement.classList.contains('collapsed')) { objElement.classList.remove('collapsed'); objElement.classList.add('expanded'); } else { objElement.classList.add('collapsed'); objElement.classList.remove('expanded'); } }

If all went well, you can now visualize the generated schema as a table.

# grader-required-cell

# Visualize the output
context.show(schema_gen.outputs['schema'])

Artifact at ./pipeline/SchemaGen/schema/8

<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
Type Presence Valency Domain
Feature name
'date_time' BYTES required -
'holiday' STRING required 'holiday'
'weather_description' STRING required 'weather_description'
'weather_main' STRING required 'weather_main'
'clouds_all' INT required -
'day' INT required -
'day_of_week' INT required -
'hour' INT required -
'month' INT required -
'rain_1h' FLOAT required -
'snow_1h' FLOAT required -
'temp' FLOAT required -
'traffic_volume' INT required -
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
Values
Domain
'holiday' 'Christmas Day', 'Columbus Day', 'Independence Day', 'Labor Day', 'Martin Luther King Jr Day', 'Memorial Day', 'New Years Day', 'None', 'State Fair', 'Thanksgiving Day', 'Veterans Day', 'Washingtons Birthday'
'weather_description' 'SQUALLS', 'Sky is Clear', 'broken clouds', 'drizzle', 'few clouds', 'fog', 'freezing rain', 'haze', 'heavy intensity drizzle', 'heavy intensity rain', 'heavy snow', 'light intensity drizzle', 'light intensity shower rain', 'light rain', 'light rain and snow', 'light shower snow', 'light snow', 'mist', 'moderate rain', 'overcast clouds', 'proximity shower rain', 'proximity thunderstorm', 'proximity thunderstorm with drizzle', 'proximity thunderstorm with rain', 'scattered clouds', 'shower drizzle', 'sky is clear', 'sleet', 'smoke', 'snow', 'thunderstorm', 'thunderstorm with heavy rain', 'thunderstorm with light drizzle', 'thunderstorm with light rain', 'thunderstorm with rain', 'very heavy rain', 'shower snow', 'thunderstorm with drizzle'
'weather_main' 'Clear', 'Clouds', 'Drizzle', 'Fog', 'Haze', 'Mist', 'Rain', 'Smoke', 'Snow', 'Squall', 'Thunderstorm'

Each attribute in your dataset shows up as a row in the schema table, alongside its properties. The schema also captures all the values that a categorical feature takes on, denoted as its domain.

This schema will be used to detect anomalies in the next step.

2.4 - ExampleValidator

The ExampleValidator component detects anomalies in your data based on the generated schema from the previous step. Like the previous two components, it also uses TFDV under the hood.

ExampleValidator will take as input the statistics from StatisticsGen and the schema from SchemaGen. By default, it compares the statistics from the evaluation split to the schema from the training split.

Exercise 5: ExampleValidator

Fill the code below to detect anomalies in your datasets.

# grader-required-cell

### START CODE HERE
# Instantiate ExampleValidator with the statistics and schema from the previous steps
example_validator = tfx.components.ExampleValidator(
    statistics=statistics_gen.outputs["statistics"],
    schema=schema_gen.outputs["schema"]
)
    
    

# Run the component
context.run(example_validator)
### END CODE HERE
<style> .tfx-object.expanded { padding: 4px 8px 4px 8px; background: white; border: 1px solid #bbbbbb; box-shadow: 4px 4px 2px rgba(0,0,0,0.05); } .tfx-object, .tfx-object * { font-size: 11pt; } .tfx-object > .title { cursor: pointer; } .tfx-object .expansion-marker { color: #999999; } .tfx-object.expanded > .title > .expansion-marker:before { content: '▼'; } .tfx-object.collapsed > .title > .expansion-marker:before { content: '▶'; } .tfx-object .class-name { font-weight: bold; } .tfx-object .deemphasize { opacity: 0.5; } .tfx-object.collapsed > table.attr-table { display: none; } .tfx-object.expanded > table.attr-table { display: block; } .tfx-object table.attr-table { border: 2px solid white; margin-top: 5px; } .tfx-object table.attr-table td.attr-name { vertical-align: top; font-weight: bold; } .tfx-object table.attr-table td.attrvalue { text-align: left; } <script> function toggleTfxObject(element) { var objElement = element.parentElement; if (objElement.classList.contains('collapsed')) { objElement.classList.remove('collapsed'); objElement.classList.add('expanded'); } else { objElement.classList.add('collapsed'); objElement.classList.remove('expanded'); } }
ExecutionResult at 0x7f0750aeba90
.execution_id9
.component<style> .tfx-object.expanded { padding: 4px 8px 4px 8px; background: white; border: 1px solid #bbbbbb; box-shadow: 4px 4px 2px rgba(0,0,0,0.05); } .tfx-object, .tfx-object * { font-size: 11pt; } .tfx-object > .title { cursor: pointer; } .tfx-object .expansion-marker { color: #999999; } .tfx-object.expanded > .title > .expansion-marker:before { content: '▼'; } .tfx-object.collapsed > .title > .expansion-marker:before { content: '▶'; } .tfx-object .class-name { font-weight: bold; } .tfx-object .deemphasize { opacity: 0.5; } .tfx-object.collapsed > table.attr-table { display: none; } .tfx-object.expanded > table.attr-table { display: block; } .tfx-object table.attr-table { border: 2px solid white; margin-top: 5px; } .tfx-object table.attr-table td.attr-name { vertical-align: top; font-weight: bold; } .tfx-object table.attr-table td.attrvalue { text-align: left; } <script> function toggleTfxObject(element) { var objElement = element.parentElement; if (objElement.classList.contains('collapsed')) { objElement.classList.remove('collapsed'); objElement.classList.add('expanded'); } else { objElement.classList.add('collapsed'); objElement.classList.remove('expanded'); } }
.component.inputs
['statistics']<style> .tfx-object.expanded { padding: 4px 8px 4px 8px; background: white; border: 1px solid #bbbbbb; box-shadow: 4px 4px 2px rgba(0,0,0,0.05); } .tfx-object, .tfx-object * { font-size: 11pt; } .tfx-object > .title { cursor: pointer; } .tfx-object .expansion-marker { color: #999999; } .tfx-object.expanded > .title > .expansion-marker:before { content: '▼'; } .tfx-object.collapsed > .title > .expansion-marker:before { content: '▶'; } .tfx-object .class-name { font-weight: bold; } .tfx-object .deemphasize { opacity: 0.5; } .tfx-object.collapsed > table.attr-table { display: none; } .tfx-object.expanded > table.attr-table { display: block; } .tfx-object table.attr-table { border: 2px solid white; margin-top: 5px; } .tfx-object table.attr-table td.attr-name { vertical-align: top; font-weight: bold; } .tfx-object table.attr-table td.attrvalue { text-align: left; } <script> function toggleTfxObject(element) { var objElement = element.parentElement; if (objElement.classList.contains('collapsed')) { objElement.classList.remove('collapsed'); objElement.classList.add('expanded'); } else { objElement.classList.add('collapsed'); objElement.classList.remove('expanded'); } }
['schema']<style> .tfx-object.expanded { padding: 4px 8px 4px 8px; background: white; border: 1px solid #bbbbbb; box-shadow: 4px 4px 2px rgba(0,0,0,0.05); } .tfx-object, .tfx-object * { font-size: 11pt; } .tfx-object > .title { cursor: pointer; } .tfx-object .expansion-marker { color: #999999; } .tfx-object.expanded > .title > .expansion-marker:before { content: '▼'; } .tfx-object.collapsed > .title > .expansion-marker:before { content: '▶'; } .tfx-object .class-name { font-weight: bold; } .tfx-object .deemphasize { opacity: 0.5; } .tfx-object.collapsed > table.attr-table { display: none; } .tfx-object.expanded > table.attr-table { display: block; } .tfx-object table.attr-table { border: 2px solid white; margin-top: 5px; } .tfx-object table.attr-table td.attr-name { vertical-align: top; font-weight: bold; } .tfx-object table.attr-table td.attrvalue { text-align: left; } <script> function toggleTfxObject(element) { var objElement = element.parentElement; if (objElement.classList.contains('collapsed')) { objElement.classList.remove('collapsed'); objElement.classList.add('expanded'); } else { objElement.classList.add('collapsed'); objElement.classList.remove('expanded'); } }
.component.outputs
['anomalies']<style> .tfx-object.expanded { padding: 4px 8px 4px 8px; background: white; border: 1px solid #bbbbbb; box-shadow: 4px 4px 2px rgba(0,0,0,0.05); } .tfx-object, .tfx-object * { font-size: 11pt; } .tfx-object > .title { cursor: pointer; } .tfx-object .expansion-marker { color: #999999; } .tfx-object.expanded > .title > .expansion-marker:before { content: '▼'; } .tfx-object.collapsed > .title > .expansion-marker:before { content: '▶'; } .tfx-object .class-name { font-weight: bold; } .tfx-object .deemphasize { opacity: 0.5; } .tfx-object.collapsed > table.attr-table { display: none; } .tfx-object.expanded > table.attr-table { display: block; } .tfx-object table.attr-table { border: 2px solid white; margin-top: 5px; } .tfx-object table.attr-table td.attr-name { vertical-align: top; font-weight: bold; } .tfx-object table.attr-table td.attrvalue { text-align: left; } <script> function toggleTfxObject(element) { var objElement = element.parentElement; if (objElement.classList.contains('collapsed')) { objElement.classList.remove('collapsed'); objElement.classList.add('expanded'); } else { objElement.classList.add('collapsed'); objElement.classList.remove('expanded'); } }

As with the previous steps, you can visualize the anomalies as a table.

# grader-required-cell

# Visualize the output
context.show(example_validator.outputs['anomalies'])

Artifact at ./pipeline/ExampleValidator/anomalies/9

'train' split:

No anomalies found.

'eval' split:

No anomalies found.

If there are anomalies detected, you should examine how you should handle it. For example, you can relax distribution constraints or modify the domain of some features. You’ve had some practice with this last week when you used TFDV and you can also do that here.

For this particular case, there should be no anomalies detected and we can proceed to the next step.

2.5 - Transform

In this section, you will use the Transform component to perform feature engineering.

Transform will take as input the data from ExampleGen, the schema from SchemaGen, as well as a module containing the preprocessing function.

The component expects an external module for your Transform code so you need to use the magic command %% writefile to save the file to disk. We have defined a few constants that group the data’s attributes according to the transforms you will perform later. This file will also be saved locally.

# grader-required-cell

# Set the constants module filename
_traffic_constants_module_file = 'traffic_constants.py'
%%writefile {_traffic_constants_module_file}

# Features to be scaled to the z-score
DENSE_FLOAT_FEATURE_KEYS = ['temp', 'snow_1h']

# Features to bucketize
BUCKET_FEATURE_KEYS = ['rain_1h']

# Number of buckets used by tf.transform for encoding each feature.
FEATURE_BUCKET_COUNT = {'rain_1h': 3}

# Feature to scale from 0 to 1
RANGE_FEATURE_KEYS = ['clouds_all']

# Number of vocabulary terms used for encoding VOCAB_FEATURES by tf.transform
VOCAB_SIZE = 1000

# Count of out-of-vocab buckets in which unrecognized VOCAB_FEATURES are hashed.
OOV_SIZE = 10

# Features with string data types that will be converted to indices
VOCAB_FEATURE_KEYS = [
    'holiday',
    'weather_main',
    'weather_description'
]

# Features with int data type that will be kept as is
CATEGORICAL_FEATURE_KEYS = [
    'hour', 'day', 'day_of_week', 'month'
]

# Feature to predict
VOLUME_KEY = 'traffic_volume'

def transformed_name(key):
    return key + '_xf'
Overwriting traffic_constants.py

Exercise 6

Next, you will fill out the transform module. As mentioned, this will also be saved to disk. Specifically, you will complete the preprocessing_fn which defines the transformations. See the code comments for instructions and refer to the tft module documentation to look up which function to use for a given group of keys.

For the label (i.e. VOLUME_KEY), you will transform it to indicate if it is greater than the mean of the entire dataset.

# grader-required-cell

# Set the transform module filename
_traffic_transform_module_file = 'traffic_transform.py'
%%writefile {_traffic_transform_module_file}

import tensorflow as tf
import tensorflow_transform as tft

import traffic_constants

# Unpack the contents of the constants module
_DENSE_FLOAT_FEATURE_KEYS = traffic_constants.DENSE_FLOAT_FEATURE_KEYS
_RANGE_FEATURE_KEYS = traffic_constants.RANGE_FEATURE_KEYS
_VOCAB_FEATURE_KEYS = traffic_constants.VOCAB_FEATURE_KEYS
_VOCAB_SIZE = traffic_constants.VOCAB_SIZE
_OOV_SIZE = traffic_constants.OOV_SIZE
_CATEGORICAL_FEATURE_KEYS = traffic_constants.CATEGORICAL_FEATURE_KEYS
_BUCKET_FEATURE_KEYS = traffic_constants.BUCKET_FEATURE_KEYS
_FEATURE_BUCKET_COUNT = traffic_constants.FEATURE_BUCKET_COUNT
_VOLUME_KEY = traffic_constants.VOLUME_KEY
_transformed_name = traffic_constants.transformed_name


def preprocessing_fn(inputs):
    """tf.transform's callback function for preprocessing inputs.
    Args:
    inputs: map from feature keys to raw not-yet-transformed features.
    Returns:
    Map from string feature key to transformed feature operations.
    """
    outputs = {}

    ### START CODE HERE
    
    # Scale these features to the z-score.
    for key in _DENSE_FLOAT_FEATURE_KEYS:
        # Scale these features to the z-score.
        outputs[_transformed_name(key)] = tft.scale_to_z_score(inputs[key])
            

    # Scale these feature/s from 0 to 1
    for key in _RANGE_FEATURE_KEYS:
        outputs[_transformed_name(key)] = tft.scale_to_0_1(inputs[key])
            

    # Transform the strings into indices 
    # hint: use the VOCAB_SIZE and OOV_SIZE to define the top_k and num_oov parameters
    for key in _VOCAB_FEATURE_KEYS:
        outputs[_transformed_name(key)] = tft.compute_and_apply_vocabulary(inputs[key], top_k=_VOCAB_SIZE, num_oov_buckets=_OOV_SIZE)
            
            
            

    # Bucketize the feature
    for key in _BUCKET_FEATURE_KEYS:
        outputs[_transformed_name(key)] = tft.bucketize(inputs[key], _FEATURE_BUCKET_COUNT[key])
            

    # Keep the features as is. No tft function needed.
    for key in _CATEGORICAL_FEATURE_KEYS:
        outputs[_transformed_name(key)] = inputs[key]

        
    # Use `tf.cast` to cast the label key to float32
    traffic_volume = tf.cast(inputs[_VOLUME_KEY], tf.float32)
  
    
    # Create a feature that shows if the traffic volume is greater than the mean and cast to an int
    outputs[_transformed_name(_VOLUME_KEY)] = tf.cast(          
        # Use `tf.greater` to check if the traffic volume in a row is greater than the mean of the entire traffic volumn column
        # Hint: Use a `tft` function to compute the mean.
        tf.greater(traffic_volume, tft.mean(tf.cast(inputs[_VOLUME_KEY], tf.float32))),
        
        tf.int64)                                        
    ### END CODE HERE
    return outputs
Overwriting traffic_transform.py
# Test your preprocessing_fn

import traffic_transform
from testing_values import feature_description, raw_data

# NOTE: These next two lines are for reloading your traffic_transform module in case you need to 
# update your initial solution and re-run this cell. Please do not remove them especially if you
# have revised your solution. Else, your changes will not be detected.
import importlib
importlib.reload(traffic_transform)

raw_data_metadata = dataset_metadata.DatasetMetadata(schema_utils.schema_from_feature_spec(feature_description))

with tft_beam.Context(temp_dir=tempfile.mkdtemp()):
    transformed_dataset, _ = (
        (raw_data, raw_data_metadata) | tft_beam.AnalyzeAndTransformDataset(traffic_transform.preprocessing_fn))

transformed_data, transformed_metadata = transformed_dataset
WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['/opt/conda/lib/python3.8/site-packages/ipykernel_launcher.py', '-f', '/home/jovyan/.local/share/jupyter/runtime/kernel-f4da0adf-41cf-4e39-8ee6-4f046fe93d16.json']
WARNING:root:Make sure that locally built Python SDK docker image has Python 3.8 interpreter.
# Test that the transformed data matches the expected output
transformed_data
[{'clouds_all_xf': 1.0,
  'day_of_week_xf': 4,
  'day_xf': 8,
  'holiday_xf': 0,
  'hour_xf': 15,
  'month_xf': 1,
  'rain_1h_xf': 2,
  'snow_1h_xf': 0.0,
  'temp_xf': 0.0,
  'traffic_volume_xf': 0,
  'weather_description_xf': 0,
  'weather_main_xf': 0}]

Expected Output:

[{'clouds_all_xf': 1.0,
  'day_of_week_xf': 4,
  'day_xf': 8,
  'holiday_xf': 0,
  'hour_xf': 15,
  'month_xf': 1,
  'rain_1h_xf': 2,
  'snow_1h_xf': 0.0,
  'temp_xf': 0.0,
  'traffic_volume_xf': 0,
  'weather_description_xf': 0,
  'weather_main_xf': 0}]
# Test that the transformed metadata's schema matches the expected output
MessageToDict(transformed_metadata.schema)
{'feature': [{'name': 'clouds_all_xf',
   'type': 'FLOAT',
   'presence': {'minFraction': 1.0},
   'shape': {}},
  {'name': 'day_of_week_xf',
   'type': 'INT',
   'presence': {'minFraction': 1.0},
   'shape': {}},
  {'name': 'day_xf',
   'type': 'INT',
   'presence': {'minFraction': 1.0},
   'shape': {}},
  {'name': 'holiday_xf',
   'type': 'INT',
   'intDomain': {'isCategorical': True},
   'presence': {'minFraction': 1.0},
   'shape': {}},
  {'name': 'hour_xf',
   'type': 'INT',
   'presence': {'minFraction': 1.0},
   'shape': {}},
  {'name': 'month_xf',
   'type': 'INT',
   'presence': {'minFraction': 1.0},
   'shape': {}},
  {'name': 'rain_1h_xf',
   'type': 'INT',
   'intDomain': {'isCategorical': True},
   'presence': {'minFraction': 1.0},
   'shape': {}},
  {'name': 'snow_1h_xf',
   'type': 'FLOAT',
   'presence': {'minFraction': 1.0},
   'shape': {}},
  {'name': 'temp_xf',
   'type': 'FLOAT',
   'presence': {'minFraction': 1.0},
   'shape': {}},
  {'name': 'traffic_volume_xf',
   'type': 'INT',
   'presence': {'minFraction': 1.0},
   'shape': {}},
  {'name': 'weather_description_xf',
   'type': 'INT',
   'intDomain': {'isCategorical': True},
   'presence': {'minFraction': 1.0},
   'shape': {}},
  {'name': 'weather_main_xf',
   'type': 'INT',
   'intDomain': {'isCategorical': True},
   'presence': {'minFraction': 1.0},
   'shape': {}}]}

Expected Output:

{'feature': [{'name': 'clouds_all_xf',
   'type': 'FLOAT',
   'presence': {'minFraction': 1.0},
   'shape': {}},
  {'name': 'day_of_week_xf',
   'type': 'INT',
   'presence': {'minFraction': 1.0},
   'shape': {}},
  {'name': 'day_xf',
   'type': 'INT',
   'presence': {'minFraction': 1.0},
   'shape': {}},
  {'name': 'holiday_xf',
   'type': 'INT',
   'intDomain': {'isCategorical': True},
   'presence': {'minFraction': 1.0},
   'shape': {}},
  {'name': 'hour_xf',
   'type': 'INT',
   'presence': {'minFraction': 1.0},
   'shape': {}},
  {'name': 'month_xf',
   'type': 'INT',
   'presence': {'minFraction': 1.0},
   'shape': {}},
  {'name': 'rain_1h_xf',
   'type': 'INT',
   'intDomain': {'isCategorical': True},
   'presence': {'minFraction': 1.0},
   'shape': {}},
  {'name': 'snow_1h_xf',
   'type': 'FLOAT',
   'presence': {'minFraction': 1.0},
   'shape': {}},
  {'name': 'temp_xf',
   'type': 'FLOAT',
   'presence': {'minFraction': 1.0},
   'shape': {}},
  {'name': 'traffic_volume_xf',
   'type': 'INT',
   'presence': {'minFraction': 1.0},
   'shape': {}},
  {'name': 'weather_description_xf',
   'type': 'INT',
   'intDomain': {'isCategorical': True},
   'presence': {'minFraction': 1.0},
   'shape': {}},
  {'name': 'weather_main_xf',
   'type': 'INT',
   'intDomain': {'isCategorical': True},
   'presence': {'minFraction': 1.0},
   'shape': {}}]}

Exercise 7

With the transform module defined, complete the code below to perform feature engineering on the raw data.

# grader-required-cell

### START CODE HERE
# Instantiate the Transform component
transform = tfx.components.Transform(
    examples=example_gen.outputs["examples"],
    schema=schema_gen.outputs["schema"],
    module_file=os.path.abspath(_traffic_transform_module_file)
)
    
    
    

# Run the component.
# The `enable_cache` flag is disabled in case you need to update your transform module file.
context.run(transform, enable_cache=False)
### END CODE HERE
WARNING:root:This output type hint will be ignored and not used for type-checking purposes. Typically, output type hints for a PTransform are single (or nested) types wrapped by a PCollection, PDone, or None. Got: Tuple[Dict[str, Union[NoneType, _Dataset]], Union[Dict[str, Dict[str, PCollection]], NoneType], int] instead.
WARNING:absl:Tables initialized inside a tf.function  will be re-initialized on every invocation of the function. This  re-initialization can have significant impact on performance. Consider lifting  them out of the graph context using  `tf.init_scope`.: compute_and_apply_vocabulary/apply_vocab/text_file_init/InitializeTableFromTextFileV2
WARNING:absl:Tables initialized inside a tf.function  will be re-initialized on every invocation of the function. This  re-initialization can have significant impact on performance. Consider lifting  them out of the graph context using  `tf.init_scope`.: compute_and_apply_vocabulary_1/apply_vocab/text_file_init/InitializeTableFromTextFileV2
WARNING:absl:Tables initialized inside a tf.function  will be re-initialized on every invocation of the function. This  re-initialization can have significant impact on performance. Consider lifting  them out of the graph context using  `tf.init_scope`.: compute_and_apply_vocabulary_2/apply_vocab/text_file_init/InitializeTableFromTextFileV2
WARNING:absl:Tables initialized inside a tf.function  will be re-initialized on every invocation of the function. This  re-initialization can have significant impact on performance. Consider lifting  them out of the graph context using  `tf.init_scope`.: compute_and_apply_vocabulary/apply_vocab/text_file_init/InitializeTableFromTextFileV2
WARNING:absl:Tables initialized inside a tf.function  will be re-initialized on every invocation of the function. This  re-initialization can have significant impact on performance. Consider lifting  them out of the graph context using  `tf.init_scope`.: compute_and_apply_vocabulary_1/apply_vocab/text_file_init/InitializeTableFromTextFileV2
WARNING:absl:Tables initialized inside a tf.function  will be re-initialized on every invocation of the function. This  re-initialization can have significant impact on performance. Consider lifting  them out of the graph context using  `tf.init_scope`.: compute_and_apply_vocabulary_2/apply_vocab/text_file_init/InitializeTableFromTextFileV2
WARNING:root:This output type hint will be ignored and not used for type-checking purposes. Typically, output type hints for a PTransform are single (or nested) types wrapped by a PCollection, PDone, or None. Got: Tuple[Dict[str, Union[NoneType, _Dataset]], Union[Dict[str, Dict[str, PCollection]], NoneType], int] instead.
WARNING:root:Make sure that locally built Python SDK docker image has Python 3.8 interpreter.
<style> .tfx-object.expanded { padding: 4px 8px 4px 8px; background: white; border: 1px solid #bbbbbb; box-shadow: 4px 4px 2px rgba(0,0,0,0.05); } .tfx-object, .tfx-object * { font-size: 11pt; } .tfx-object > .title { cursor: pointer; } .tfx-object .expansion-marker { color: #999999; } .tfx-object.expanded > .title > .expansion-marker:before { content: '▼'; } .tfx-object.collapsed > .title > .expansion-marker:before { content: '▶'; } .tfx-object .class-name { font-weight: bold; } .tfx-object .deemphasize { opacity: 0.5; } .tfx-object.collapsed > table.attr-table { display: none; } .tfx-object.expanded > table.attr-table { display: block; } .tfx-object table.attr-table { border: 2px solid white; margin-top: 5px; } .tfx-object table.attr-table td.attr-name { vertical-align: top; font-weight: bold; } .tfx-object table.attr-table td.attrvalue { text-align: left; } <script> function toggleTfxObject(element) { var objElement = element.parentElement; if (objElement.classList.contains('collapsed')) { objElement.classList.remove('collapsed'); objElement.classList.add('expanded'); } else { objElement.classList.add('collapsed'); objElement.classList.remove('expanded'); } }
ExecutionResult at 0x7f07601245b0
.execution_id10
.component<style> .tfx-object.expanded { padding: 4px 8px 4px 8px; background: white; border: 1px solid #bbbbbb; box-shadow: 4px 4px 2px rgba(0,0,0,0.05); } .tfx-object, .tfx-object * { font-size: 11pt; } .tfx-object > .title { cursor: pointer; } .tfx-object .expansion-marker { color: #999999; } .tfx-object.expanded > .title > .expansion-marker:before { content: '▼'; } .tfx-object.collapsed > .title > .expansion-marker:before { content: '▶'; } .tfx-object .class-name { font-weight: bold; } .tfx-object .deemphasize { opacity: 0.5; } .tfx-object.collapsed > table.attr-table { display: none; } .tfx-object.expanded > table.attr-table { display: block; } .tfx-object table.attr-table { border: 2px solid white; margin-top: 5px; } .tfx-object table.attr-table td.attr-name { vertical-align: top; font-weight: bold; } .tfx-object table.attr-table td.attrvalue { text-align: left; } <script> function toggleTfxObject(element) { var objElement = element.parentElement; if (objElement.classList.contains('collapsed')) { objElement.classList.remove('collapsed'); objElement.classList.add('expanded'); } else { objElement.classList.add('collapsed'); objElement.classList.remove('expanded'); } }
.component.inputs
['examples']<style> .tfx-object.expanded { padding: 4px 8px 4px 8px; background: white; border: 1px solid #bbbbbb; box-shadow: 4px 4px 2px rgba(0,0,0,0.05); } .tfx-object, .tfx-object * { font-size: 11pt; } .tfx-object > .title { cursor: pointer; } .tfx-object .expansion-marker { color: #999999; } .tfx-object.expanded > .title > .expansion-marker:before { content: '▼'; } .tfx-object.collapsed > .title > .expansion-marker:before { content: '▶'; } .tfx-object .class-name { font-weight: bold; } .tfx-object .deemphasize { opacity: 0.5; } .tfx-object.collapsed > table.attr-table { display: none; } .tfx-object.expanded > table.attr-table { display: block; } .tfx-object table.attr-table { border: 2px solid white; margin-top: 5px; } .tfx-object table.attr-table td.attr-name { vertical-align: top; font-weight: bold; } .tfx-object table.attr-table td.attrvalue { text-align: left; } <script> function toggleTfxObject(element) { var objElement = element.parentElement; if (objElement.classList.contains('collapsed')) { objElement.classList.remove('collapsed'); objElement.classList.add('expanded'); } else { objElement.classList.add('collapsed'); objElement.classList.remove('expanded'); } }
['schema']<style> .tfx-object.expanded { padding: 4px 8px 4px 8px; background: white; border: 1px solid #bbbbbb; box-shadow: 4px 4px 2px rgba(0,0,0,0.05); } .tfx-object, .tfx-object * { font-size: 11pt; } .tfx-object > .title { cursor: pointer; } .tfx-object .expansion-marker { color: #999999; } .tfx-object.expanded > .title > .expansion-marker:before { content: '▼'; } .tfx-object.collapsed > .title > .expansion-marker:before { content: '▶'; } .tfx-object .class-name { font-weight: bold; } .tfx-object .deemphasize { opacity: 0.5; } .tfx-object.collapsed > table.attr-table { display: none; } .tfx-object.expanded > table.attr-table { display: block; } .tfx-object table.attr-table { border: 2px solid white; margin-top: 5px; } .tfx-object table.attr-table td.attr-name { vertical-align: top; font-weight: bold; } .tfx-object table.attr-table td.attrvalue { text-align: left; } <script> function toggleTfxObject(element) { var objElement = element.parentElement; if (objElement.classList.contains('collapsed')) { objElement.classList.remove('collapsed'); objElement.classList.add('expanded'); } else { objElement.classList.add('collapsed'); objElement.classList.remove('expanded'); } }
.component.outputs
['transform_graph']<style> .tfx-object.expanded { padding: 4px 8px 4px 8px; background: white; border: 1px solid #bbbbbb; box-shadow: 4px 4px 2px rgba(0,0,0,0.05); } .tfx-object, .tfx-object * { font-size: 11pt; } .tfx-object > .title { cursor: pointer; } .tfx-object .expansion-marker { color: #999999; } .tfx-object.expanded > .title > .expansion-marker:before { content: '▼'; } .tfx-object.collapsed > .title > .expansion-marker:before { content: '▶'; } .tfx-object .class-name { font-weight: bold; } .tfx-object .deemphasize { opacity: 0.5; } .tfx-object.collapsed > table.attr-table { display: none; } .tfx-object.expanded > table.attr-table { display: block; } .tfx-object table.attr-table { border: 2px solid white; margin-top: 5px; } .tfx-object table.attr-table td.attr-name { vertical-align: top; font-weight: bold; } .tfx-object table.attr-table td.attrvalue { text-align: left; } <script> function toggleTfxObject(element) { var objElement = element.parentElement; if (objElement.classList.contains('collapsed')) { objElement.classList.remove('collapsed'); objElement.classList.add('expanded'); } else { objElement.classList.add('collapsed'); objElement.classList.remove('expanded'); } }
['transformed_examples']<style> .tfx-object.expanded { padding: 4px 8px 4px 8px; background: white; border: 1px solid #bbbbbb; box-shadow: 4px 4px 2px rgba(0,0,0,0.05); } .tfx-object, .tfx-object * { font-size: 11pt; } .tfx-object > .title { cursor: pointer; } .tfx-object .expansion-marker { color: #999999; } .tfx-object.expanded > .title > .expansion-marker:before { content: '▼'; } .tfx-object.collapsed > .title > .expansion-marker:before { content: '▶'; } .tfx-object .class-name { font-weight: bold; } .tfx-object .deemphasize { opacity: 0.5; } .tfx-object.collapsed > table.attr-table { display: none; } .tfx-object.expanded > table.attr-table { display: block; } .tfx-object table.attr-table { border: 2px solid white; margin-top: 5px; } .tfx-object table.attr-table td.attr-name { vertical-align: top; font-weight: bold; } .tfx-object table.attr-table td.attrvalue { text-align: left; } <script> function toggleTfxObject(element) { var objElement = element.parentElement; if (objElement.classList.contains('collapsed')) { objElement.classList.remove('collapsed'); objElement.classList.add('expanded'); } else { objElement.classList.add('collapsed'); objElement.classList.remove('expanded'); } }
['updated_analyzer_cache']<style> .tfx-object.expanded { padding: 4px 8px 4px 8px; background: white; border: 1px solid #bbbbbb; box-shadow: 4px 4px 2px rgba(0,0,0,0.05); } .tfx-object, .tfx-object * { font-size: 11pt; } .tfx-object > .title { cursor: pointer; } .tfx-object .expansion-marker { color: #999999; } .tfx-object.expanded > .title > .expansion-marker:before { content: '▼'; } .tfx-object.collapsed > .title > .expansion-marker:before { content: '▶'; } .tfx-object .class-name { font-weight: bold; } .tfx-object .deemphasize { opacity: 0.5; } .tfx-object.collapsed > table.attr-table { display: none; } .tfx-object.expanded > table.attr-table { display: block; } .tfx-object table.attr-table { border: 2px solid white; margin-top: 5px; } .tfx-object table.attr-table td.attr-name { vertical-align: top; font-weight: bold; } .tfx-object table.attr-table td.attrvalue { text-align: left; } <script> function toggleTfxObject(element) { var objElement = element.parentElement; if (objElement.classList.contains('collapsed')) { objElement.classList.remove('collapsed'); objElement.classList.add('expanded'); } else { objElement.classList.add('collapsed'); objElement.classList.remove('expanded'); } }
['pre_transform_schema']<style> .tfx-object.expanded { padding: 4px 8px 4px 8px; background: white; border: 1px solid #bbbbbb; box-shadow: 4px 4px 2px rgba(0,0,0,0.05); } .tfx-object, .tfx-object * { font-size: 11pt; } .tfx-object > .title { cursor: pointer; } .tfx-object .expansion-marker { color: #999999; } .tfx-object.expanded > .title > .expansion-marker:before { content: '▼'; } .tfx-object.collapsed > .title > .expansion-marker:before { content: '▶'; } .tfx-object .class-name { font-weight: bold; } .tfx-object .deemphasize { opacity: 0.5; } .tfx-object.collapsed > table.attr-table { display: none; } .tfx-object.expanded > table.attr-table { display: block; } .tfx-object table.attr-table { border: 2px solid white; margin-top: 5px; } .tfx-object table.attr-table td.attr-name { vertical-align: top; font-weight: bold; } .tfx-object table.attr-table td.attrvalue { text-align: left; } <script> function toggleTfxObject(element) { var objElement = element.parentElement; if (objElement.classList.contains('collapsed')) { objElement.classList.remove('collapsed'); objElement.classList.add('expanded'); } else { objElement.classList.add('collapsed'); objElement.classList.remove('expanded'); } }
['pre_transform_stats']<style> .tfx-object.expanded { padding: 4px 8px 4px 8px; background: white; border: 1px solid #bbbbbb; box-shadow: 4px 4px 2px rgba(0,0,0,0.05); } .tfx-object, .tfx-object * { font-size: 11pt; } .tfx-object > .title { cursor: pointer; } .tfx-object .expansion-marker { color: #999999; } .tfx-object.expanded > .title > .expansion-marker:before { content: '▼'; } .tfx-object.collapsed > .title > .expansion-marker:before { content: '▶'; } .tfx-object .class-name { font-weight: bold; } .tfx-object .deemphasize { opacity: 0.5; } .tfx-object.collapsed > table.attr-table { display: none; } .tfx-object.expanded > table.attr-table { display: block; } .tfx-object table.attr-table { border: 2px solid white; margin-top: 5px; } .tfx-object table.attr-table td.attr-name { vertical-align: top; font-weight: bold; } .tfx-object table.attr-table td.attrvalue { text-align: left; } <script> function toggleTfxObject(element) { var objElement = element.parentElement; if (objElement.classList.contains('collapsed')) { objElement.classList.remove('collapsed'); objElement.classList.add('expanded'); } else { objElement.classList.add('collapsed'); objElement.classList.remove('expanded'); } }
['post_transform_schema']<style> .tfx-object.expanded { padding: 4px 8px 4px 8px; background: white; border: 1px solid #bbbbbb; box-shadow: 4px 4px 2px rgba(0,0,0,0.05); } .tfx-object, .tfx-object * { font-size: 11pt; } .tfx-object > .title { cursor: pointer; } .tfx-object .expansion-marker { color: #999999; } .tfx-object.expanded > .title > .expansion-marker:before { content: '▼'; } .tfx-object.collapsed > .title > .expansion-marker:before { content: '▶'; } .tfx-object .class-name { font-weight: bold; } .tfx-object .deemphasize { opacity: 0.5; } .tfx-object.collapsed > table.attr-table { display: none; } .tfx-object.expanded > table.attr-table { display: block; } .tfx-object table.attr-table { border: 2px solid white; margin-top: 5px; } .tfx-object table.attr-table td.attr-name { vertical-align: top; font-weight: bold; } .tfx-object table.attr-table td.attrvalue { text-align: left; } <script> function toggleTfxObject(element) { var objElement = element.parentElement; if (objElement.classList.contains('collapsed')) { objElement.classList.remove('collapsed'); objElement.classList.add('expanded'); } else { objElement.classList.add('collapsed'); objElement.classList.remove('expanded'); } }
['post_transform_stats']<style> .tfx-object.expanded { padding: 4px 8px 4px 8px; background: white; border: 1px solid #bbbbbb; box-shadow: 4px 4px 2px rgba(0,0,0,0.05); } .tfx-object, .tfx-object * { font-size: 11pt; } .tfx-object > .title { cursor: pointer; } .tfx-object .expansion-marker { color: #999999; } .tfx-object.expanded > .title > .expansion-marker:before { content: '▼'; } .tfx-object.collapsed > .title > .expansion-marker:before { content: '▶'; } .tfx-object .class-name { font-weight: bold; } .tfx-object .deemphasize { opacity: 0.5; } .tfx-object.collapsed > table.attr-table { display: none; } .tfx-object.expanded > table.attr-table { display: block; } .tfx-object table.attr-table { border: 2px solid white; margin-top: 5px; } .tfx-object table.attr-table td.attr-name { vertical-align: top; font-weight: bold; } .tfx-object table.attr-table td.attrvalue { text-align: left; } <script> function toggleTfxObject(element) { var objElement = element.parentElement; if (objElement.classList.contains('collapsed')) { objElement.classList.remove('collapsed'); objElement.classList.add('expanded'); } else { objElement.classList.add('collapsed'); objElement.classList.remove('expanded'); } }
['post_transform_anomalies']<style> .tfx-object.expanded { padding: 4px 8px 4px 8px; background: white; border: 1px solid #bbbbbb; box-shadow: 4px 4px 2px rgba(0,0,0,0.05); } .tfx-object, .tfx-object * { font-size: 11pt; } .tfx-object > .title { cursor: pointer; } .tfx-object .expansion-marker { color: #999999; } .tfx-object.expanded > .title > .expansion-marker:before { content: '▼'; } .tfx-object.collapsed > .title > .expansion-marker:before { content: '▶'; } .tfx-object .class-name { font-weight: bold; } .tfx-object .deemphasize { opacity: 0.5; } .tfx-object.collapsed > table.attr-table { display: none; } .tfx-object.expanded > table.attr-table { display: block; } .tfx-object table.attr-table { border: 2px solid white; margin-top: 5px; } .tfx-object table.attr-table td.attr-name { vertical-align: top; font-weight: bold; } .tfx-object table.attr-table td.attrvalue { text-align: left; } <script> function toggleTfxObject(element) { var objElement = element.parentElement; if (objElement.classList.contains('collapsed')) { objElement.classList.remove('collapsed'); objElement.classList.add('expanded'); } else { objElement.classList.add('collapsed'); objElement.classList.remove('expanded'); } }

You should see the output cell by InteractiveContext above and see the three artifacts in .component.outputs:

The transform_graph artifact URI should point to a directory containing:

Again, for grading purposes, we inserted an except and else below to handle checking the output outside the notebook environment.

# grader-required-cell

try:
    # Get the uri of the transform graph
    transform_graph_uri = transform.outputs['transform_graph'].get()[0].uri

except IndexError:
    print("context.run() was no-op")
    transform_path = './pipeline/Transform/transformed_examples'
    dir_id = os.listdir(transform_path)[0]
    transform_graph_uri = f'{transform_path}/{dir_id}'
    
else:
    # List the subdirectories under the uri
    os.listdir(transform_graph_uri)

Lastly, you can also take a look at a few of the transformed examples.

# grader-required-cell

try:
    # Get the URI of the output artifact representing the transformed examples
    train_uri = os.path.join(transform.outputs['transformed_examples'].get()[0].uri, 'Split-train')
    
except IndexError:
    print("context.run() was no-op")
    train_uri = os.path.join(transform_graph_uri, 'Split-train')
# grader-required-cell

# Get the list of files in this directory (all compressed TFRecord files)
tfrecord_filenames = [os.path.join(train_uri, name)
                      for name in os.listdir(train_uri)]

# Create a `TFRecordDataset` to read these files
transformed_dataset = tf.data.TFRecordDataset(tfrecord_filenames, compression_type="GZIP")
# grader-required-cell

# Get 3 records from the dataset
sample_records_xf = get_records(transformed_dataset, 3)

# Print the output
pp.pprint(sample_records_xf)
[{'features': {'feature': {'clouds_all_xf': {'floatList': {'value': [0.4]}},
                           'day_of_week_xf': {'int64List': {'value': ['1']}},
                           'day_xf': {'int64List': {'value': ['2']}},
                           'holiday_xf': {'int64List': {'value': ['0']}},
                           'hour_xf': {'int64List': {'value': ['9']}},
                           'month_xf': {'int64List': {'value': ['10']}},
                           'rain_1h_xf': {'int64List': {'value': ['2']}},
                           'snow_1h_xf': {'floatList': {'value': [-0.027424404]}},
                           'temp_xf': {'floatList': {'value': [0.53368515]}},
                           'traffic_volume_xf': {'int64List': {'value': ['1']}},
                           'weather_description_xf': {'int64List': {'value': ['4']}},
                           'weather_main_xf': {'int64List': {'value': ['0']}}}}},
 {'features': {'feature': {'clouds_all_xf': {'floatList': {'value': [0.75]}},
                           'day_of_week_xf': {'int64List': {'value': ['1']}},
                           'day_xf': {'int64List': {'value': ['2']}},
                           'holiday_xf': {'int64List': {'value': ['0']}},
                           'hour_xf': {'int64List': {'value': ['10']}},
                           'month_xf': {'int64List': {'value': ['10']}},
                           'rain_1h_xf': {'int64List': {'value': ['2']}},
                           'snow_1h_xf': {'floatList': {'value': [-0.027424404]}},
                           'temp_xf': {'floatList': {'value': [0.6156976]}},
                           'traffic_volume_xf': {'int64List': {'value': ['1']}},
                           'weather_description_xf': {'int64List': {'value': ['3']}},
                           'weather_main_xf': {'int64List': {'value': ['0']}}}}},
 {'features': {'feature': {'clouds_all_xf': {'floatList': {'value': [0.9]}},
                           'day_of_week_xf': {'int64List': {'value': ['1']}},
                           'day_xf': {'int64List': {'value': ['2']}},
                           'holiday_xf': {'int64List': {'value': ['0']}},
                           'hour_xf': {'int64List': {'value': ['11']}},
                           'month_xf': {'int64List': {'value': ['10']}},
                           'rain_1h_xf': {'int64List': {'value': ['2']}},
                           'snow_1h_xf': {'floatList': {'value': [-0.027424404]}},
                           'temp_xf': {'floatList': {'value': [0.6324042]}},
                           'traffic_volume_xf': {'int64List': {'value': ['1']}},
                           'weather_description_xf': {'int64List': {'value': ['2']}},
                           'weather_main_xf': {'int64List': {'value': ['0']}}}}}]

Congratulations on completing this week’s assignment! You’ve just demonstrated how to build a data pipeline and do feature engineering. You will build upon these concepts in the next weeks where you will deal with more complex datasets and also access the metadata store. Keep up the good work!

Please click here if you want to experiment with any of the non-graded code.

Important Note: Please only do this when you've already passed the assignment to avoid problems with the autograder.

  1. On the notebook’s menu, click “View” > “Cell Toolbar” > “Edit Metadata”
  2. Hit the “Edit Metadata” button next to the code cell which you want to lock/unlock
  3. Set the attribute value for “editable” to:
    • “true” if you want to unlock it
    • “false” if you want to lock it
  4. On the notebook’s menu, click “View” > “Cell Toolbar” > “None”

Here's a short demo of how to do the steps above: