Boosting tf.data pipeline
ETL:
Data and its problems
tf.data.Dataset.cache()
tf.data.Dataset.cache(filename=...)
num_cores = multiprocessing.cpu_count()
augmented_dataset = dataset.map(augment, num_parallel_calls=num_cores)
dataset = tfds.load("cats_vs_dogs", split=tfds.Split.TRAIN)
train_dataset = dataset.map(format_image).prefetch(tf.data.experimental.AUTOTUNE)
Using files.interleave
to parallelize data extraction process.
Reduce overhead in map:
dataset = dataset.batch(BATCH_SIZE).map(func)
or
options = tf.data.Options()
options.experimental_optimization.map_vectorization.enabled = True
dataset = dataset.with_options(options)
Cache
transformed_dataset = dataset.map(transform).cache()
Using repeat.shuffle
for better performance