In this lab, you will revisit the News Headlines Dataset for Sarcasm Detection from last week and proceed to build a train a model on it. The steps will be very similar to the previous lab with IMDB Reviews with just some minor modifications. You can tweak the hyperparameters and see how it affects the results. Let’s begin!
You will first download the JSON file, load it into your workspace and put the sentences and labels into lists.
# Download the dataset
!wget https://storage.googleapis.com/tensorflow-1-public/course3/sarcasm.json
import json
# Load the JSON file
with open("./sarcasm.json", 'r') as f:
datastore = json.load(f)
# Initialize the lists
sentences = []
labels = []
# Collect sentences and labels into the lists
for item in datastore:
sentences.append(item['headline'])
labels.append(item['is_sarcastic'])
We placed the hyperparameters in the cell below so you can easily tweak it later:
# Number of examples to use for training
training_size = 20000
# Vocabulary size of the tokenizer
vocab_size = 10000
# Maximum length of the padded sequences
max_length = 32
# Output dimensions of the Embedding layer
embedding_dim = 16
Next, you will generate your train and test datasets. You will use the training_size
value you set above to slice the sentences
and labels
lists into two sublists: one fore training and another for testing.
# Split the sentences
training_sentences = sentences[0:training_size]
testing_sentences = sentences[training_size:]
# Split the labels
training_labels = labels[0:training_size]
testing_labels = labels[training_size:]
Now you can preprocess the text and labels so it can be consumed by the model. You use the Tokenizer
class to create the vocabulary and the pad_sequences
method to generate padded token sequences. You will also need to set the labels to a numpy array so it can be a valid data type for model.fit()
.
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
# Parameters for padding and OOV tokens
trunc_type='post'
padding_type='post'
oov_tok = "<OOV>"
# Initialize the Tokenizer class
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
# Generate the word index dictionary
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index
# Generate and pad the training sequences
training_sequences = tokenizer.texts_to_sequences(training_sentences)
training_padded = pad_sequences(training_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)
# Generate and pad the testing sequences
testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)
# Convert the labels lists into numpy arrays
training_labels = np.array(training_labels)
testing_labels = np.array(testing_labels)
Next, you will build the model. The architecture is similar to the previous lab but you will use a GlobalAveragePooling1D layer instead of Flatten
after the Embedding. This adds the task of averaging over the sequence dimension before connecting to the dense layers. See a short demo of how this works using the snippet below. Notice that it gets the average over 3 arrays (i.e. (10 + 1 + 1) / 3
and (2 + 3 + 1) / 3
to arrive at the final output.
import tensorflow as tf
# Initialize a GlobalAveragePooling1D (GAP1D) layer
gap1d_layer = tf.keras.layers.GlobalAveragePooling1D()
# Define sample array
sample_array = np.array([[[10,2],[1,3],[1,1]]])
# Print shape and contents of sample array
print(f'shape of sample_array = {sample_array.shape}')
print(f'sample array: {sample_array}')
# Pass the sample array to the GAP1D layer
output = gap1d_layer(sample_array)
# Print shape and contents of the GAP1D output array
print(f'output shape of gap1d_layer: {output.shape}')
print(f'output array of gap1d_layer: {output.numpy()}')
This added computation reduces the dimensionality of the model as compared to using Flatten()
and thus, the number of training parameters will also decrease. See the output of model.summary()
below and see how it compares if you swap out the pooling layer with a simple Flatten()
.
# Build the model
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
tf.keras.layers.GlobalAveragePooling1D(),
tf.keras.layers.Dense(24, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
# Print the model summary
model.summary()
You will use the same loss, optimizer, and metrics from the previous lab.
# Compile the model
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
Now you will feed in the prepared datasets to train the model. If you used the default hyperparameters, you will get around 99% training accuracy and 80% validation accuracy.
Tip: You can set the verbose
parameter of model.fit()
to 2
to indicate that you want to print just the results per epoch. Setting it to 1
(default) displays a progress bar per epoch, while 0
silences all displays. It doesn’t matter much in this Colab but when working in a production environment, you may want to set this to 2
as recommended in the documentation.
num_epochs = 30
# Train the model
history = model.fit(training_padded, training_labels, epochs=num_epochs, validation_data=(testing_padded, testing_labels), verbose=2)
You can use the cell below to plot the training results. You may notice some overfitting because your validation accuracy is slowly dropping while the training accuracy is still going up. See if you can improve it by tweaking the hyperparameters. Some example values are shown in the lectures.
import matplotlib.pyplot as plt
# Plot utility
def plot_graphs(history, string):
plt.plot(history.history[string])
plt.plot(history.history['val_'+string])
plt.xlabel("Epochs")
plt.ylabel(string)
plt.legend([string, 'val_'+string])
plt.show()
# Plot the accuracy and loss
plot_graphs(history, "accuracy")
plot_graphs(history, "loss")
As before, you can visualize the final weights of the embeddings using the Tensorflow Embedding Projector.
# Get the index-word dictionary
reverse_word_index = tokenizer.index_word
# Get the embedding layer from the model (i.e. first layer)
embedding_layer = model.layers[0]
# Get the weights of the embedding layer
embedding_weights = embedding_layer.get_weights()[0]
# Print the shape. Expected is (vocab_size, embedding_dim)
print(embedding_weights.shape)
import io
# Open writeable files
out_v = io.open('vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('meta.tsv', 'w', encoding='utf-8')
# Initialize the loop. Start counting at `1` because `0` is just for the padding
for word_num in range(1, vocab_size):
# Get the word associated at the current index
word_name = reverse_word_index[word_num]
# Get the embedding weights associated with the current index
word_embedding = embedding_weights[word_num]
# Write the word name
out_m.write(word_name + "\n")
# Write the word embedding
out_v.write('\t'.join([str(x) for x in word_embedding]) + "\n")
# Close the files
out_v.close()
out_m.close()
# Import files utilities in Colab
try:
from google.colab import files
except ImportError:
pass
# Download the files
else:
files.download('vecs.tsv')
files.download('meta.tsv')
In this lab, you were able to build a binary classifier to detect sarcasm. You saw some overfitting in the initial attempt and hopefully, you were able to arrive at a better set of hyperparameters.
So far, you’ve been tokenizing datasets from scratch and you’re treating the vocab size as a hyperparameter. Furthermore, you’re tokenizing the texts by building a vocabulary of full words. In the next lab, you will make use of a pre-tokenized dataset that uses a vocabulary of subwords. For instance, instead of having a uniqe token for the word Tensorflow
, it will instead have a token each for Ten
, sor
, and flow
. You will see the motivation and implications of having this design in the next exercise. See you there!