Coursera

Week 2

Instruction fine-tuning

Finetuning with instruction prompts:

one/few shot inference
- Not always work for smaller LLMs
- Examples take up space in context window

Prepare instruction dataset -> split

Finetuning on a single task

Good result can be achieved with only 500-1000 examples.
Can leads to catastrophic forgetting: degrade performance on other tasks than the finetuned task,

How to avoid?

Finetuning on multiple tasks at the same time.
Consider Parameter Efficient Fine-tuning (PEFT)

Multitask instruction finetuning

Dataset contains examples from variety of tasks.

FLAN (Fine-tuned LAnguage Net)

Model evaluation

$$ \text{Accuracy} = \frac{\text{Correct Predictions}}{\text{Total Predictions}} $$

Using other metrices:

ROUGE: for text summarization (compare a summary to more reference summaries)
BLEU: for text translation (compare translated text to human-generated translations)

$$ \text{ROUGE-1 Recall} = \frac{\text{unigram matches}}{\text{unigrams in reference}} $$

$$ \text{ROUGE-1 Precision} = \frac{\text{unigram matches}}{\text{unigrams in output}} $$

$$ \text{ROUGE-1 F1} = 2\times \frac{\text{precision}\times\text{recall}}{\text{precision + recall}} $$

For ROUGE-2 is the bigram, ROUGE-3 is trigram, ..etc.

However, if the reference is

It is cold outside

and the generated output is

cold cold cold cold

Hence, the ROUGE-1 precision will be

$$ \text{ROUGE-1 Precision} = \frac{\text{unigram matches}}{\text{unigrams in output}} = \frac{4}{4} = 1.0 $$

And we don’t want that to happen. Using clipping function for modified precision:

$$ \text{ROUGE-1 Modified Precision} = \frac{\text{clip(unigram matches)}}{\text{unigrams in output}} = \frac{1}{4} = 0.25 $$

With the generated output is

outside cold it is

the modified precision will still, given the 1.0 precision. Hence, in this situation we will ot use unigram, but bigram or trigram.

Benchmarks

GLUE (General Language Understanding Evaluation - 2018)
SuperGLUE (2019)
MMLU (Massive Multitask Language Understanding - 2021)
BIG-bec (2022)
HELM (Holistic Evaluation of Language Models - 2022)

Parameter efficient fine-tuning (EPFT)

Finetuning a subset of model parameters to prevent catastrophic forgetting.

Methods:

Selective: select subset of initial LLM parameters to finetune.
Reparameterization: reparameterize model weights using a low-rank representation.
Additive: add trainable layers or parameters to model:
- Adapters: add new trainable layers.
- Soft prompts: adding trainable parameters to prompt embeddings / retraining embedding weights.

LoRA - Low-Rank Adaptation of Large Language Models

(Reparameterization)

Freeze most of the original LLM weights.
Inject 2 rank decomposition matrices.
Train the weights of the smaller matrices.

Steps to update model for inference:

Matrix multiply the low rank matrices.
Add to original weights.

Example using base transformer as reference (86% reduction in parameters to train):

Model	Weight dim	Weights in total
Base transformer	512 x 64	32768
LoRA `r=8`	A (8 x 64), B (512 x 8)	512 + 4096

LoRA for generative LLMs: A and B are different weights for different tasks.

Soft prompts

Prompt tuning is not prompt engineering.

Adding soft prompt vectors to the embedding layer.

Full finetuning: weights of model updated during training.
Prompt tuning: weights of model frozen, soft prompt trained.

Switch soft prompt to change finetuning task.

Quiz

Question	Answer
1. Fill in the blanks: __________ involves using many prompt-completion examples as the labeled training dataset to continue training the model by updating its weights. This is different from _________ where you provide prompt-completion examples during inference.	Instruction fine-tuning, in-context learning
2. Fine-tuning a model on a single task can improve model performance specifically on that task; however, it can also degrade the performance of other tasks as a side effect. This phenomenon is known as:	Catastrophic forgetting
3. Which evaluation metric below focuses on precision in matching generated output to the reference text and is used for text translation?	BLEU
4. Which of the following statements about multi-task finetuning is correct? Select all that apply:	help prevent catastrophic forgetting & FLAN-T5 was trained with multitask finetuning
5. “Smaller LLMs can struggle with one-shot and few-shot inference:”	True
6. Which of the following are Parameter Efficient Fine-Tuning (PEFT) methods? Select all that apply.	Reparameterization, Selective & Additive
7. Which of the following best describes how LoRA works?	Decompose weights to smaller matrices and train those
8. What is a soft prompt in the context of LLMs (Large Language Models)?	A set of trainable tokens
9. “Prompt Tuning is a technique used to adjust all hyperparameters of a language model.”	False
10. “PEFT methods can reduce the memory needed for fine-tuning dramatically, sometimes to just 12-20% of the memory needed for full fine-tuning.”	True