Coursera

Week 4

Tuning and Performance Improvements in your Pipeline

Boosting tf.data pipeline

ETL:

What happens when you train a model?

Data and its problems

Caching

tf.data.Dataset.cache()
tf.data.Dataset.cache(filename=...)

Parallelism APIs

num_cores = multiprocessing.cpu_count()

augmented_dataset = dataset.map(augment, num_parallel_calls=num_cores)

Autotuning

dataset = tfds.load("cats_vs_dogs", split=tfds.Split.TRAIN)

train_dataset = dataset.map(format_image).prefetch(tf.data.experimental.AUTOTUNE)

Parallelizing Data Extraction

Using files.interleave to parallelize data extraction process.

Best practices for code improvements

Reduce overhead in map:

dataset = dataset.batch(BATCH_SIZE).map(func)

or

options = tf.data.Options()
options.experimental_optimization.map_vectorization.enabled = True
dataset = dataset.with_options(options)

Cache

transformed_dataset = dataset.map(transform).cache()

Using repeat.shuffle for better performance