In the previous Hugging Face ungraded lab, you saw how to use the pipeline objects to use transformer models for NLP tasks. I showed you that the model didnβt output the desired answers to a series of precise questions for a context related to the history of comic books.
In this lab, you will fine-tune the model from that lab to give better answers for that type of context. To do that, youβll be using the TyDi QA dataset but on a filtered version with only English examples. Additionally, you will use a lot of the tools that Hugging Face has to offer.
You have to note that, in general, you will fine-tune general-purpose transformer models to work for specific tasks. However, fine-tuning a general-purpose model can take a lot of time. Thatβs why you will be using the model from the question answering pipeline in this lab.
First, letβs install some packages that you will use during the lab.
!pip install transformers datasets torch;
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K |ββββββββββββββββββββββββββββββββ| 4.4 MB 5.2 MB/s
[?25hCollecting datasets
Downloading datasets-2.3.2-py3-none-any.whl (362 kB)
[K |ββββββββββββββββββββββββββββββββ| 362 kB 47.3 MB/s
[?25hRequirement already satisfied: torch in /usr/local/lib/python3.7/dist-packages (1.12.0+cu113)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.7/dist-packages (from transformers) (4.64.0)
Collecting pyyaml>=5.1
Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K |ββββββββββββββββββββββββββββββββ| 596 kB 47.7 MB/s
[?25hRequirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.7/dist-packages (from transformers) (1.21.6)
Collecting huggingface-hub<1.0,>=0.1.0
Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K |ββββββββββββββββββββββββββββββββ| 101 kB 6.3 MB/s
[?25hRequirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from transformers) (4.12.0)
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K |ββββββββββββββββββββββββββββββββ| 6.6 MB 34.0 MB/s
[?25hRequirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.7/dist-packages (from transformers) (21.3)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.7/dist-packages (from transformers) (2022.6.2)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from transformers) (2.23.0)
Requirement already satisfied: filelock in /usr/local/lib/python3.7/dist-packages (from transformers) (3.7.1)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.7/dist-packages (from huggingface-hub<1.0,>=0.1.0->transformers) (4.1.1)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging>=20.0->transformers) (3.0.9)
Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from datasets) (1.3.5)
Collecting fsspec[http]>=2021.05.0
Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)
[K |ββββββββββββββββββββββββββββββββ| 140 kB 51.6 MB/s
[?25hCollecting xxhash
Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K |ββββββββββββββββββββββββββββββββ| 212 kB 59.9 MB/s
[?25hRequirement already satisfied: pyarrow>=6.0.0 in /usr/local/lib/python3.7/dist-packages (from datasets) (6.0.1)
Requirement already satisfied: multiprocess in /usr/local/lib/python3.7/dist-packages (from datasets) (0.70.13)
Collecting aiohttp
Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K |ββββββββββββββββββββββββββββββββ| 1.1 MB 66.3 MB/s
[?25hCollecting responses<0.19
Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Requirement already satisfied: dill<0.3.6 in /usr/local/lib/python3.7/dist-packages (from datasets) (0.3.5.1)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (2022.6.15)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (1.24.3)
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
Downloading urllib3-1.25.11-py2.py3-none-any.whl (127 kB)
[K |ββββββββββββββββββββββββββββββββ| 127 kB 60.0 MB/s
[?25hCollecting aiosignal>=1.1.2
Downloading aiosignal-1.2.0-py3-none-any.whl (8.2 kB)
Collecting multidict<7.0,>=4.5
Downloading multidict-6.0.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (94 kB)
[K |ββββββββββββββββββββββββββββββββ| 94 kB 3.9 MB/s
[?25hCollecting yarl<2.0,>=1.0
Downloading yarl-1.7.2-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (271 kB)
[K |ββββββββββββββββββββββββββββββββ| 271 kB 74.2 MB/s
[?25hCollecting frozenlist>=1.1.1
Downloading frozenlist-1.3.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (144 kB)
[K |ββββββββββββββββββββββββββββββββ| 144 kB 73.8 MB/s
[?25hRequirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (21.4.0)
Collecting asynctest==0.13.0
Downloading asynctest-0.13.0-py3-none-any.whl (26 kB)
Requirement already satisfied: charset-normalizer<3.0,>=2.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (2.1.0)
Collecting async-timeout<5.0,>=4.0.0a3
Downloading async_timeout-4.0.2-py3-none-any.whl (5.8 kB)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->transformers) (3.8.1)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas->datasets) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas->datasets) (2022.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas->datasets) (1.15.0)
Installing collected packages: multidict, frozenlist, yarl, urllib3, asynctest, async-timeout, aiosignal, pyyaml, fsspec, aiohttp, xxhash, tokenizers, responses, huggingface-hub, transformers, datasets
Attempting uninstall: urllib3
Found existing installation: urllib3 1.24.3
Uninstalling urllib3-1.24.3:
Successfully uninstalled urllib3-1.24.3
Attempting uninstall: pyyaml
Found existing installation: PyYAML 3.13
Uninstalling PyYAML-3.13:
Successfully uninstalled PyYAML-3.13
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.[0m
Successfully installed aiohttp-3.8.1 aiosignal-1.2.0 async-timeout-4.0.2 asynctest-0.13.0 datasets-2.3.2 frozenlist-1.3.0 fsspec-2022.5.0 huggingface-hub-0.8.1 multidict-6.0.2 pyyaml-6.0 responses-0.18.0 tokenizers-0.12.1 transformers-4.20.1 urllib3-1.25.11 xxhash-3.0.0 yarl-1.7.2
As you saw in the previous lab, you can use these pipelines as they are. But sometimes, youβll need something more specific to your problem, or maybe you need it to perform better on your production data. In these cases, youβll need to fine-tune a model.
Here, youβll fine-tune a pre-trained DistilBERT model on the TyDi QA dataset.
To fine-tune your model, you will leverage three components provided by Hugging Face:
###Β Datasets
To get the dataset to fine-tune your model, you will use π€ Datasets, a lightweight and extensible library to share and access datasets and evaluation metrics for NLP easily. You can download Hugging Face datasets directly using the load_dataset
function from the datasets
library. Although the most common approach is to use load_dataset
, for this lab you will use a filtered version containing only the English examples. You can read them from a public GCP bucket and use the load_from_disk
function.
Hugging Face datasets
allows to load data in several formats, such as CSV, JSON, text files and even parquet. You can see more about the supported formats in the documentation
We already prepared the dataset for you, so you donβt need to uncomment the code from the cell below if you donβt want to load all the data and then filter the English examples. If you want to download the dataset by yourself, you can uncomment the following cell and then jump to the cell in which you can see the type of object you get after loading the dataset.
# You can download the dataset and process it to obtain the same dataset we are loading from disk
#Β Uncomment the following lines to download the dataset directly
#Β from datasets import load_dataset
#Β train_data = load_dataset('tydiqa', 'primary_task')
# tydiqa_data = Β train_data.filter(lambda example: example['language'] == 'english')
If you want to use the dataset provided by us, please run the following cells. First, we will download the dataset from the GCP bucket.
# Download dataset from bucket.
!wget https://storage.googleapis.com/nlprefresh-public/tydiqa_data.zip
--2022-07-23 10:03:30-- https://storage.googleapis.com/nlprefresh-public/tydiqa_data.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 108.177.13.128, 172.217.204.128, 172.253.123.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|108.177.13.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 333821654 (318M) [application/zip]
Saving to: βtydiqa_data.zipβ
tydiqa_data.zip 100%[===================>] 318.36M 132MB/s in 2.4s
2022-07-23 10:03:33 (132 MB/s) - βtydiqa_data.zipβ saved [333821654/333821654]
# Uncomment if you want to check the size of the file. It should be around 319M.
!ls -alh tydiqa_data.zip
-rw-r--r-- 1 root root 319M Sep 9 2021 tydiqa_data.zip
Now, letβs unzip the dataset
# Unzip inside the dataset folder
!unzip tydiqa_data
Archive: tydiqa_data.zip
inflating: tydiqa_data/validation/dataset_info.json
inflating: tydiqa_data/dataset_dict.json
inflating: tydiqa_data/train/state.json
inflating: tydiqa_data/train/dataset_info.json
inflating: tydiqa_data/validation/dataset.arrow
inflating: tydiqa_data/validation/cache-32664b2bb6ecb93c.arrow
inflating: tydiqa_data/validation/cache-981c6a4602432980.arrow
inflating: tydiqa_data/validation/cache-0adce067eac1391a.arrow
inflating: tydiqa_data/validation/cache-22dd192df839003a.arrow
inflating: tydiqa_data/validation/cache-de50d25427e34427.arrow
inflating: tydiqa_data/train/cache-a7d4fcf0afedf699.arrow
inflating: tydiqa_data/train/cache-bec06ea6cf14cfc1.arrow
inflating: tydiqa_data/validation/state.json
inflating: tydiqa_data/train/dataset.arrow
inflating: tydiqa_data/train/cache-ce4e04eb371cb7de.arrow
Given that we used Apache Arrow format to save the dataset, you have to use the load_from_disk
function from the datasets
library to load it. To access the preprocessed dataset we created, you should execute the following commands.
# Execute this cell if you will use the data we processed instead of downloading it.
from datasets import load_from_disk
#The path where the dataset is stored
path = '/content/tydiqa_data/'
#Load Dataset
tydiqa_data = load_from_disk(path)
tydiqa_data
DatasetDict({
train: Dataset({
features: ['passage_answer_candidates', 'question_text', 'document_title', 'language', 'annotations', 'document_plaintext', 'document_url'],
num_rows: 9211
})
validation: Dataset({
features: ['passage_answer_candidates', 'question_text', 'document_title', 'language', 'annotations', 'document_plaintext', 'document_url'],
num_rows: 1031
})
})
You can check below that the type of the loaded dataset is a datasets.arrow_dataset.Dataset
. This object type corresponds to an Apache Arrow Table that allows creating a hash table that contains the position in memory where data is stored instead of loading the complete dataset into memory. But you donβt have to worry too much about that. It is just an efficient way to work with lots of data.
# Checking the object type for one of the elements in the dataset
type(tydiqa_data['train'])
datasets.arrow_dataset.Dataset
You can also check the structure of the dataset:
tydiqa_data['train']
Dataset({
features: ['passage_answer_candidates', 'question_text', 'document_title', 'language', 'annotations', 'document_plaintext', 'document_url'],
num_rows: 9211
})
You can see that each example is like a dictionary object. This dataset consists of questions, contexts, and indices that point to the start and end position of the answer inside the context. You can access the index using the annotations
key, which is a kind of dictionary.
idx = 600
# start index
start_index = tydiqa_data['train'][idx]['annotations']['minimal_answers_start_byte'][0]
# end index
end_index = tydiqa_data['train'][idx]['annotations']['minimal_answers_end_byte'][0]
print("Question: " + tydiqa_data['train'][idx]['question_text'])
print("\nContext (truncated): "+ tydiqa_data['train'][idx]['document_plaintext'][0:512] + '...')
print("\nAnswer: " + tydiqa_data['train'][idx]['document_plaintext'][start_index:end_index])
Question: What mental effects can a mother experience after childbirth?
Context (truncated):
Postpartum depression (PPD), also called postnatal depression, is a type of mood disorder associated with childbirth, which can affect both sexes.[1][3] Symptoms may include extreme sadness, low energy, anxiety, crying episodes, irritability, and changes in sleeping or eating patterns.[1] Onset is typically between one week and one month following childbirth.[1] PPD can also negatively affect the newborn child.[2]
While the exact cause of PPD is unclear, the cause is believed to be a combination of physi...
Answer: Postpartum depression (PPD)
The question answering model predicts a start and endpoint in the context to extract as the answer. Thatβs why this NLP task is known as extractive question answering.
To train your model, you need to pass start and endpoints as labels. So, you need to implement a function that extracts the start and end positions from the dataset.
The dataset contains unanswerable questions. For these, the start and end indices for the answer are equal to -1
.
tydiqa_data['train'][0]['annotations']
{'minimal_answers_end_byte': [-1],
'minimal_answers_start_byte': [-1],
'passage_answer_candidate_index': [-1],
'yes_no_answer': ['NONE']}
Now, you have to flatten the dataset to work with an object with a table structure instead of a dictionary structure. This step facilitates the pre-processing steps.
#Β Flattening the datasets
flattened_train_data = tydiqa_data['train'].flatten()
flattened_test_data = tydiqa_data['validation'].flatten()
Also, to make the training more straightforward and faster, we will extract a subset of the train and test datasets. For that purpose, we will use the Hugging Face Dataset objectβs method called select()
. This method allows you to take some data points by their index. Here, you will select the first 3000 rows; you can play with the number of data points but consider that this will increase the training time.
# Selecting a subset of the train dataset
flattened_train_data = flattened_train_data.select(range(3000))
# Selecting a subset of the test dataset
flattened_test_data = flattened_test_data.select(range(1000))
Now, you will use the tokenizer object from Hugging Face. You can load a tokenizer using different methods. Here, you will retrieve it from the pipeline object you created in the previous Hugging Face lab. With this tokenizer, you can ensure that the tokens you get for the dataset will match the tokens used in the original DistilBERT implementation.
When loading a tokenizer with any method, you must pass the model checkpoint that you want to fine-tune. Here, you are using the'distilbert-base-cased-distilled-squad'
checkpoint.
# Import the AutoTokenizer from the transformers library
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased-distilled-squad")
Downloading: 0%| | 0.00/29.0 [00:00<?, ?B/s]
Downloading: 0%| | 0.00/473 [00:00<?, ?B/s]
Downloading: 0%| | 0.00/208k [00:00<?, ?B/s]
Downloading: 0%| | 0.00/426k [00:00<?, ?B/s]
Given the characteristics of the dataset and the question-answering task, you will need to add some steps to pre-process the data after the tokenization:
When there is no answer to a question given a context, you will use the CLS
token, a unique token used to represent the start of the sequence.
Tokenizers can split a given string into substrings, resulting in a subtoken for each substring, creating misalignment between the list of dataset tags and the labels generated by the tokenizer. Therefore, you will need to align the start and end indices with the tokens associated with the target answer word.
Finally, a tokenizer can truncate a very long sequence. So, if the start/end position of an answer is None
, you will assume that it was truncated and assign the maximum length of the tokenizer to those positions.
Those three steps are done within the process_samples
function defined below.
# Processing samples using the 3 steps described.
def process_samples(sample):
tokenized_data = tokenizer(sample['document_plaintext'], sample['question_text'], truncation="only_first", padding="max_length")
input_ids = tokenized_data["input_ids"]
# We will label impossible answers with the index of the CLS token.
cls_index = input_ids.index(tokenizer.cls_token_id)
# If no answers are given, set the cls_index as answer.
if sample["annotations.minimal_answers_start_byte"][0] == -1:
start_position = cls_index
end_position = cls_index
else:
# Start/end character index of the answer in the text.
gold_text = sample["document_plaintext"][sample['annotations.minimal_answers_start_byte'][0]:sample['annotations.minimal_answers_end_byte'][0]]
start_char = sample["annotations.minimal_answers_start_byte"][0]
end_char = sample['annotations.minimal_answers_end_byte'][0] #start_char + len(gold_text)
# sometimes answers are off by a character or two β fix this
if sample['document_plaintext'][start_char-1:end_char-1] == gold_text:
start_char = start_char - 1
end_char = end_char - 1 # When the gold label is off by one character
elif sample['document_plaintext'][start_char-2:end_char-2] == gold_text:
start_char = start_char - 2
end_char = end_char - 2 # When the gold label is off by two characters
start_token = tokenized_data.char_to_token(start_char)
end_token = tokenized_data.char_to_token(end_char - 1)
# if start position is None, the answer passage has been truncated
if start_token is None:
start_token = tokenizer.model_max_length
if end_token is None:
end_token = tokenizer.model_max_length
start_position = start_token
end_position = end_token
return {'input_ids': tokenized_data['input_ids'],
'attention_mask': tokenized_data['attention_mask'],
'start_positions': start_position,
'end_positions': end_position}
To apply the process_samples
function defined above to the whole dataset, you can use the map
method as follows:
# Tokenizing and processing the flattened dataset
processed_train_data = flattened_train_data.map(process_samples)
processed_test_data = flattened_test_data.map(process_samples)
Parameter 'function'=<function process_samples at 0x7f7c46cd45f0> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
0%| | 0/3000 [00:00<?, ?ex/s]
0%| | 0/1000 [00:00<?, ?ex/s]
#Β Transformers
The last component of Hugging Face that is useful for fine-tuning a transformer corresponds to the pre-trained models you can access in multiple ways.
For this lab, you will use the same model from the question-answering pipeline that you loaded before.
# Import the AutoModelForQuestionAnswering for the pre-trained model. We will only fine tune the head of the model
from transformers import AutoModelForQuestionAnswering
model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-cased-distilled-squad")
Downloading: 0%| | 0.00/249M [00:00<?, ?B/s]
Now, you can take the necessary columns from the datasets to train/test and return them as Pytorch Tensors.
columns_to_return = ['input_ids','attention_mask', 'start_positions', 'end_positions']
processed_train_data.set_format(type='pt', columns=columns_to_return)
processed_test_data.set_format(type='pt', columns=columns_to_return)
Here, we give you the F1 score as a metric to evaluate your modelβs performance. We will use this metric for simplicity, although it is based on the start and end values predicted by the model. If you want to dig deeper on other metrics that can be used for a question and answering task, you can also check this colab notebook resource from the Hugging Face team.
from sklearn.metrics import f1_score
def compute_f1_metrics(pred):
start_labels = pred.label_ids[0]
start_preds = pred.predictions[0].argmax(-1)
end_labels = pred.label_ids[1]
end_preds = pred.predictions[1].argmax(-1)
f1_start = f1_score(start_labels, start_preds, average='macro')
f1_end = f1_score(end_labels, end_preds, average='macro')
return {
'f1_start': f1_start,
'f1_end': f1_end,
}
Now, you will use the Hugging Face Trainer to fine-tune your model.
#Β Training the model may take around 15 minutes.
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir='model_results5', # output directory
overwrite_output_dir=True,
num_train_epochs=3, # total number of training epochs
per_device_train_batch_size=8, # batch size per device during training
per_device_eval_batch_size=8, # batch size for evaluation
warmup_steps=20, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
logging_dir=None, # directory for storing logs
logging_steps=50
)
trainer = Trainer(
model=model, # the instantiated π€ Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=processed_train_data, # training dataset
eval_dataset=processed_test_data, # evaluation dataset
compute_metrics=compute_f1_metrics
)
trainer.train()
The following columns in the training set don't have a corresponding argument in `DistilBertForQuestionAnswering.forward` and have been ignored: language, annotations.yes_no_answer, annotations.minimal_answers_start_byte, question_text, annotations.minimal_answers_end_byte, document_title, document_plaintext, passage_answer_candidates.plaintext_end_byte, annotations.passage_answer_candidate_index, passage_answer_candidates.plaintext_start_byte, document_url. If language, annotations.yes_no_answer, annotations.minimal_answers_start_byte, question_text, annotations.minimal_answers_end_byte, document_title, document_plaintext, passage_answer_candidates.plaintext_end_byte, annotations.passage_answer_candidate_index, passage_answer_candidates.plaintext_start_byte, document_url are not expected by `DistilBertForQuestionAnswering.forward`, you can safely ignore this message.
/usr/local/lib/python3.7/dist-packages/transformers/optimization.py:310: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
FutureWarning,
***** Running training *****
Num examples = 3000
Num Epochs = 3
Instantaneous batch size per device = 8
Total train batch size (w. parallel, distributed & accumulation) = 8
Gradient Accumulation steps = 1
Total optimization steps = 1125
<div>
<progress value='1125' max='1125' style='width:300px; height:20px; vertical-align: middle;'></progress>
[1125/1125 07:26, Epoch 3/3]
</div>
<table border="1" class="dataframe">
Saving model checkpoint to model_results5/checkpoint-500
Configuration saved in model_results5/checkpoint-500/config.json
Model weights saved in model_results5/checkpoint-500/pytorch_model.bin
Saving model checkpoint to model_results5/checkpoint-1000
Configuration saved in model_results5/checkpoint-1000/config.json
Model weights saved in model_results5/checkpoint-1000/pytorch_model.bin
Training completed. Do not forget to share your model on huggingface.co/models =)
TrainOutput(global_step=1125, training_loss=1.2449743559095594, metrics={'train_runtime': 449.7034, 'train_samples_per_second': 20.013, 'train_steps_per_second': 2.502, 'total_flos': 1175877900288000.0, 'train_loss': 1.2449743559095594, 'epoch': 3.0})
And, in the next cell, you can evaluate the fine-tuned modelβs performance on the test set.
# The evaluation may take around 30 seconds
trainer.evaluate(processed_test_data)
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForQuestionAnswering.forward` and have been ignored: language, annotations.yes_no_answer, annotations.minimal_answers_start_byte, question_text, annotations.minimal_answers_end_byte, document_title, document_plaintext, passage_answer_candidates.plaintext_end_byte, annotations.passage_answer_candidate_index, passage_answer_candidates.plaintext_start_byte, document_url. If language, annotations.yes_no_answer, annotations.minimal_answers_start_byte, question_text, annotations.minimal_answers_end_byte, document_title, document_plaintext, passage_answer_candidates.plaintext_end_byte, annotations.passage_answer_candidate_index, passage_answer_candidates.plaintext_start_byte, document_url are not expected by `DistilBertForQuestionAnswering.forward`, you can safely ignore this message.
***** Running Evaluation *****
Num examples = 1000
Batch size = 8
[125/125 00:16]
{'epoch': 3.0,
'eval_f1_end': 0.10903973263672619,
'eval_f1_start': 0.09401088809221052,
'eval_loss': 2.3242716789245605,
'eval_runtime': 16.8474,
'eval_samples_per_second': 59.356,
'eval_steps_per_second': 7.42}
###Β Using your Fine-Tuned Model
After training and evaluating your fine-tuned model, you can check its results for the same questions from the previous lab.
For that, you will tell Pytorch to use your GPU or your CPU to run the model. Additionally, you will need to tokenize your input context and questions. Finally, you need to post-process the output results to transform them from tokens to human-readable strings using the tokenizer
.
import torch
text = r"""
The Golden Age of Comic Books describes an era of American comic books from the
late 1930s to circa 1950. During this time, modern comic books were first published
and rapidly increased in popularity. The superhero archetype was created and many
well-known characters were introduced, including Superman, Batman, Captain Marvel
(later known as SHAZAM!), Captain America, and Wonder Woman.
Between 1939 and 1941 Detective Comics and its sister company, All-American Publications,
introduced popular superheroes such as Batman and Robin, Wonder Woman, the Flash,
Green Lantern, Doctor Fate, the Atom, Hawkman, Green Arrow and Aquaman.[7] Timely Comics,
the 1940s predecessor of Marvel Comics, had million-selling titles featuring the Human Torch,
the Sub-Mariner, and Captain America.[8]
As comic books grew in popularity, publishers began launching titles that expanded
into a variety of genres. Dell Comics' non-superhero characters (particularly the
licensed Walt Disney animated-character comics) outsold the superhero comics of the day.[12]
The publisher featured licensed movie and literary characters such as Mickey Mouse, Donald Duck,
Roy Rogers and Tarzan.[13] It was during this era that noted Donald Duck writer-artist
Carl Barks rose to prominence.[14] Additionally, MLJ's introduction of Archie Andrews
in Pep Comics #22 (December 1941) gave rise to teen humor comics,[15] with the Archie
Andrews character remaining in print well into the 21st century.[16]
At the same time in Canada, American comic books were prohibited importation under
the War Exchange Conservation Act[17] which restricted the importation of non-essential
goods. As a result, a domestic publishing industry flourished during the duration
of the war which were collectively informally called the Canadian Whites.
The educational comic book Dagwood Splits the Atom used characters from the comic
strip Blondie.[18] According to historian Michael A. Amundson, appealing comic-book
characters helped ease young readers' fear of nuclear war and neutralize anxiety
about the questions posed by atomic power.[19] It was during this period that long-running
humor comics debuted, including EC's Mad and Carl Barks' Uncle Scrooge in Dell's Four
Color Comics (both in 1952).[20][21]
"""
questions = ["What superheroes were introduced between 1939 and 1941 by Detective Comics and its sister company?",
"What comic book characters were created between 1939 and 1941?",
"What well-known characters were created between 1939 and 1941?",
"What well-known superheroes were introduced between 1939 and 1941 by Detective Comics?"]
for question in questions:
inputs = tokenizer.encode_plus(question, text, return_tensors="pt")
#print("inputs", inputs)
#print("inputs", type(inputs))
input_ids = inputs["input_ids"].tolist()[0]
inputs.to("cuda")
text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
answer_model = model(**inputs)
answer_start = torch.argmax(
answer_model['start_logits']
) # Get the most likely beginning of answer with the argmax of the score
answer_end = torch.argmax(answer_model['end_logits']) + 1 # Get the most likely end of answer with the argmax of the score
answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
print(f"Question: {question}")
print(f"Answer: {answer}\n")
Question: What superheroes were introduced between 1939 and 1941 by Detective Comics and its sister company?
Answer: Superman, Batman, Captain Marvel ( later known as SHAZAM! ), Captain America, and Wonder Woman. Between 1939 and 1941 Detective Comics and its sister company, All - American Publications, introduced popular superheroes such as Batman and Robin, Wonder Woman, the Flash, Green Lantern, Doctor Fate, the Atom, Hawkman, Green Arrow and Aquaman
Question: What comic book characters were created between 1939 and 1941?
Answer: Superman, Batman, Captain Marvel ( later known as SHAZAM! ), Captain America, and Wonder Woman
Question: What well-known characters were created between 1939 and 1941?
Answer: Superman, Batman, Captain Marvel ( later known as SHAZAM! ), Captain America, and Wonder Woman
Question: What well-known superheroes were introduced between 1939 and 1941 by Detective Comics?
Answer: Superman, Batman, Captain Marvel ( later known as SHAZAM! ), Captain America, and Wonder Woman
You can compare those results with those obtained using the pipeline, as you did in the previous lab. As a reminder, here are those results:
What popular superheroes were introduced between 1939 and 1941?
>> teen humor comics
What superheroes were introduced between 1939 and 1941 by Detective Comics and its sister company?
>> Archie Andrews
What comic book characters were created between 1939 and 1941?
>> Archie
Andrews
What well-known characters were created between 1939 and 1941?
>> Archie
Andrews
What well-known superheroes were introduced between 1939 and 1941 by Detective Comics?
>> Archie Andrews
Congratulations!
You have finished this series of ungraded labs. You were able to:
Explore the Hugging Face Pipelines, which can be used right out of the bat.
Fine-tune a model for the Extractive Question & Answering task.
I recommend you go through the free Hugging Face course to explore their ecosystem in more detail and find different ways to use the transformers
library.