Coursera

Week 4

Boosting tf.data pipeline

ETL:

Data and its problems

tf.data.Dataset.cache()

tf.data.Dataset.cache(filename=...)

num_cores = multiprocessing.cpu_count()

augmented_dataset = dataset.map(augment, num_parallel_calls=num_cores)

dataset = tfds.load("cats_vs_dogs", split=tfds.Split.TRAIN)

train_dataset = dataset.map(format_image).prefetch(tf.data.experimental.AUTOTUNE)

Using files.interleave to parallelize data extraction process.

Reduce overhead in map:

dataset = dataset.batch(BATCH_SIZE).map(func)

options = tf.data.Options()
options.experimental_optimization.map_vectorization.enabled = True
dataset = dataset.with_options(options)

Cache

transformed_dataset = dataset.map(transform).cache()

Using repeat.shuffle for better performance