Coursera

#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Text Classification

In this notebook we will classify movie reviews as being either positive or negative. We’ll use the IMDB dataset that contains the text of 50,000 movie reviews from the Internet Movie Database. These are split into 25,000 reviews for training and 25,000 reviews for testing. The training and testing sets are balanced, meaning they contain an equal number of positive and negative reviews.

Setup

try:
    %tensorflow_version 2.x
except:
    pass
Colab only includes TensorFlow 2.x; %tensorflow_version has no effect.
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds
tfds.disable_progress_bar()

print("\u2022 Using TensorFlow Version:", tf.__version__)
• Using TensorFlow Version: 2.14.0

Download the IMDB Dataset

We will download the IMDB dataset using TensorFlow Datasets. We will use a training set, a validation set, and a test set. Since the IMDB dataset doesn’t have a validation split, we will use the first 60% of the training set for training, and the last 40% of the training set for validation.

splits = ['train[:60%]', 'train[-40%:]', 'test']

splits, info = tfds.load(name="imdb_reviews", with_info=True, split=splits, as_supervised=True)

train_data, validation_data, test_data = splits
Downloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...
Dataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.

Explore the Data

Let’s take a moment to look at the data.

num_train_examples = info.splits['train'].num_examples
num_test_examples = info.splits['test'].num_examples
num_classes = info.features['label'].num_classes

print('The Dataset has a total of:')
print('\u2022 {:,} classes'.format(num_classes))

print('\u2022 {:,} movie reviews for training'.format(num_train_examples))
print('\u2022 {:,} movie reviews for testing'.format(num_test_examples))
The Dataset has a total of:
• 2 classes
• 25,000 movie reviews for training
• 25,000 movie reviews for testing

The labels are either 0 or 1, where 0 is a negative review, and 1 is a positive review. We will create a list with the corresponding class names, so that we can map labels to class names later on.

class_names = ['negative', 'positive']

Each example consists of a sentence representing the movie review and a corresponding label. The sentence is not preprocessed in any way. Let’s take a look at the first example of the training set.

for review, label in train_data.take(1):
    review = review.numpy()
    label = label.numpy()

    print('\nMovie Review:\n\n', review)
    print('\nLabel:', class_names[label])
Movie Review:

 b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."

Label: negative

Load Word Embeddings

In this example, the input data consists of sentences. The labels to predict are either 0 or 1.

One way to represent the text is to convert sentences into word embeddings. Word embeddings, are an efficient way to represent words using dense vectors, where semantically similar words have similar vectors. We can use a pre-trained text embedding as the first layer of our model, which will have two advantages:

For this example we will use a model from TensorFlow Hub called google/tf2-preview/gnews-swivel-20dim/1. We’ll create a hub.KerasLayer that uses the TensorFlow Hub model to embed the sentences. We can choose to fine-tune the TF hub module weights during training by setting the trainable parameter to True.

# if you are running the notebook on Colab
embedding = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"

# # if you are running the notebook on your local machine
# embedding = "./models/tf2-preview_gnews-swivel-20dim_1"

hub_layer = hub.KerasLayer(embedding, input_shape=[], dtype=tf.string, trainable=True)

Build Pipeline

batch_size = 512

train_batches = train_data.shuffle(num_train_examples // 4).batch(batch_size).prefetch(1)
validation_batches = validation_data.batch(batch_size).prefetch(1)
test_batches = test_data.batch(batch_size)

Build the Model

In the code below we will build a Keras Sequential model with the following layers:

  1. The first layer is a TensorFlow Hub layer. This layer uses a pre-trained SavedModel to map a sentence into its embedding vector. The model that we are using (google/tf2-preview/gnews-swivel-20dim/1) splits the sentence into tokens, embeds each token and then combines the embedding. The resulting dimensions are: (num_examples, embedding_dimension).

  2. This fixed-length output vector is piped through a fully-connected (Dense) layer with 16 hidden units.

  3. The last layer is densely connected with a single output node. Using the sigmoid activation function, this value is a float between 0 and 1, representing a probability, or confidence level.

model = tf.keras.Sequential([
        hub_layer,
        tf.keras.layers.Dense(16, activation='relu'),
        tf.keras.layers.Dense(1, activation='sigmoid')])

Train the Model

Since this is a binary classification problem and the model outputs a probability (a single-unit layer with a sigmoid activation), we’ll use the binary_crossentropy loss function.

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

history = model.fit(train_batches,
                    epochs=20,
                    validation_data=validation_batches)
Epoch 1/20
30/30 [==============================] - 14s 269ms/step - loss: 0.7074 - accuracy: 0.5796 - val_loss: 0.6596 - val_accuracy: 0.6237
Epoch 2/20
30/30 [==============================] - 6s 199ms/step - loss: 0.6324 - accuracy: 0.6447 - val_loss: 0.6103 - val_accuracy: 0.6688
Epoch 3/20
30/30 [==============================] - 7s 224ms/step - loss: 0.5869 - accuracy: 0.6945 - val_loss: 0.5724 - val_accuracy: 0.7095
Epoch 4/20
30/30 [==============================] - 6s 194ms/step - loss: 0.5475 - accuracy: 0.7323 - val_loss: 0.5370 - val_accuracy: 0.7394
Epoch 5/20
30/30 [==============================] - 7s 218ms/step - loss: 0.5092 - accuracy: 0.7657 - val_loss: 0.5032 - val_accuracy: 0.7677
Epoch 6/20
30/30 [==============================] - 5s 176ms/step - loss: 0.4705 - accuracy: 0.7935 - val_loss: 0.4700 - val_accuracy: 0.7876
Epoch 7/20
30/30 [==============================] - 6s 177ms/step - loss: 0.4325 - accuracy: 0.8181 - val_loss: 0.4402 - val_accuracy: 0.8037
Epoch 8/20
30/30 [==============================] - 5s 177ms/step - loss: 0.3963 - accuracy: 0.8345 - val_loss: 0.4156 - val_accuracy: 0.8160
Epoch 9/20
30/30 [==============================] - 6s 210ms/step - loss: 0.3639 - accuracy: 0.8507 - val_loss: 0.3894 - val_accuracy: 0.8287
Epoch 10/20
30/30 [==============================] - 5s 156ms/step - loss: 0.3353 - accuracy: 0.8659 - val_loss: 0.3700 - val_accuracy: 0.8374
Epoch 11/20
30/30 [==============================] - 6s 211ms/step - loss: 0.3065 - accuracy: 0.8806 - val_loss: 0.3556 - val_accuracy: 0.8447
Epoch 12/20
30/30 [==============================] - 4s 134ms/step - loss: 0.2832 - accuracy: 0.8913 - val_loss: 0.3417 - val_accuracy: 0.8518
Epoch 13/20
30/30 [==============================] - 5s 179ms/step - loss: 0.2616 - accuracy: 0.9009 - val_loss: 0.3308 - val_accuracy: 0.8560
Epoch 14/20
30/30 [==============================] - 4s 142ms/step - loss: 0.2420 - accuracy: 0.9103 - val_loss: 0.3228 - val_accuracy: 0.8596
Epoch 15/20
30/30 [==============================] - 3s 112ms/step - loss: 0.2259 - accuracy: 0.9177 - val_loss: 0.3157 - val_accuracy: 0.8627
Epoch 16/20
30/30 [==============================] - 5s 160ms/step - loss: 0.2091 - accuracy: 0.9259 - val_loss: 0.3108 - val_accuracy: 0.8655
Epoch 17/20
30/30 [==============================] - 5s 149ms/step - loss: 0.1946 - accuracy: 0.9319 - val_loss: 0.3078 - val_accuracy: 0.8673
Epoch 18/20
30/30 [==============================] - 4s 129ms/step - loss: 0.1816 - accuracy: 0.9379 - val_loss: 0.3057 - val_accuracy: 0.8700
Epoch 19/20
30/30 [==============================] - 5s 149ms/step - loss: 0.1700 - accuracy: 0.9434 - val_loss: 0.3043 - val_accuracy: 0.8719
Epoch 20/20
30/30 [==============================] - 3s 101ms/step - loss: 0.1588 - accuracy: 0.9479 - val_loss: 0.3042 - val_accuracy: 0.8710

Evaluate the Model

We will now see how well our model performs on the testing set.

eval_results = model.evaluate(test_batches, verbose=0)

for metric, value in zip(model.metrics_names, eval_results):
    print(metric + ': {:.3}'.format(value))
loss: 0.321
accuracy: 0.864