Coursera

Week 3

Reinforcement Learning from Human Feedback (RLHF)

Models may behaving badly:

Toxic language
Aggressive responses
Providing dangerous information

Instruct finetuned LLM -> RLHF -> Human-aligned LLM

Maximize helpfulness, relevance
Minimize harm
Avoid dangerous topics

Reinforcement Learning

------------>	Agent ---------------
|				  |                 |
| state s      	  |	reward r	    | action a
----------- Environment <------------

RL policy: LLM
Objective: generate aligned (helpful, relevance, non-toxic) text
Action: token vocabulary
State: current context
Reward: there’s a reward model

Human role: scorer (score the dataset to evaluate).

Obtaining feedback from humans

Collect human feedback:

Define model alignment criterion
Obtain human feedback from labeler workforce (ranking completions).

Prepare labeled data for training:

Convert rankings into pairwise training data for the reward model.
$ y_j $ is always the preferred completion

Reward model

Train reward model to predict preferred completion from ${ y_j , y_k }$ for prompt $x$

Use reward model as a binary classifier to provide reward value for each prompt-completion pair

Use reward model to finetune LLM with RL

									Update
Prompt dataset -> Instruct LLM <------------------ RL Algorithm
                       |                               |
					   |--------> Reward model --------|

RL Algorithm example: Proximal Policy Optimization (PPO)

Promixmal Policy Optimization

Phases

Create completions
- Create completions from prompts
- Calculate rewards
- Calculate value loss.
Model update
- Calculate policy loss
- Calculate entropy loss
- Objective function: $L^{PPO} = L^{POLICY} + c_1L^{VF} + c_2L^{ENT}$

Reward hacking

Example: Toxicity Reward Model - the completion will eventually get lower and lower toxicity, until a point it starts to generate not-meaningful and not useful phrases.

Avoid: using original model with freezed weights as reference model, then compare two completions with KL Divergence to get a penalty score and adding it to the PPO.

Scaling human feedback

Requires a lot of human scorer.

Constitutional AI:

Helpful LLM -> Red Teaming -> Response, critique and revision -> Finetuned LLM -> Generate responses to “Red Teamming” prompts -> Ask model which response is preferred? -> Reward model -> Finetune LLM with Preferences -> Constitutional LLM

Model optimization for deployment

Distillation (teacher - student)
Quantization (train for a lower representation) - Post-training Quantization (PTQ)
- Applied to model weigts (and/or activations)
- Requires calibration
Pruning (remove redundant parameters from model - model weigts with values close or equal to zero)
- Full-model retraining
- PEFT/LoRA
- Post-training
- In theory: reduces model size and improves performance.
- In practice: only small % in LLMs are zero-weights

Time and Effort in the lifecycle

png

Using the LLM in applications

LLMs can also suffer from

Out of date informations
Wrong informations
Hallucination

A way:

User app -> orchestration library -> LLM
				|
			External applications / web

Retrieval augmented generation (RAG): external data is a vector store -> making external knowledge available to the model.

Data preperation for RAG:

Data must fit inside context window
Data must be in format that allows its relevance to be assessed at inference time: embedding vectors

Interaction with external applications

Requirements:

Plan actions
Format outputs
Validate actions

Prompt structure is important!

Helping LLMs reason and plan with chain-of-thought

LLMs can struggle with complex reasoning problems.

Chain-of-Thought Prompting: breaking the original prompt into a chain of clues.

Program-aided language model (PAL)

Orchestration library will interact with Python interpreter to calculate complex results.

ReAct: Combining reasoning and action

ReAct: Syergizing Reasoning and Action in LLMs (LLM + websearch API)

Question
Thought: reasoning step indentifies how the model will tackle the problem and identify the action to take.
Action: external task that the model can carry out from an allowed set of actions
Observation: result of carrying out the action.

Instructions -> ReAct example <- Question to be answered
					|
					v
			LLM to get the answer

LLM application architectures

Building generative applications

png

Responsible AI

Toxicity: LLMs returns responses that can be potentially harmful or discriminatory towards protected groups or protected attributes.
- Careful curation of training data
- Train guardrail models to filter out unwanted content
- Diverse group of human annotators
Hallucinations: LLMs generates factuall incorrect content
- Educate users about how generative AI works.
- Add disclaimers
- Augment LLM with independent, verified citation database.
Intellectual property: ensure people aren’t plagiarizing, make sure there aren’t any copyright issues.
- Mix of technology, policy and legal mechanisms.
- Machine “unlearning”
- Filtering and blocking approaches

Responsibly build and use generative AI models

Define use case: the more specific/narrow, the better
Assess risks for each use case.
Evaluate performance for each use case.
Iterate over entire AI lifecycle

Ongoing research

Responsible AI
Scale models and predict performance
More efficiencies across model development lifecycle
Increased and emergent LLM capabilities

Quiz

Question	Answer
1. Which of the following are true in regards to Constitutional AI? Select all that apply.	Red Teamming & To obtain revised answers & choose between responses
2. What does the “Proximal” in Proximal Policy Optimization refer to?	The constraint that limits the distance between the new and old policy
3. “You can use an algorithm other than Proximal Policy Optimization to update the model weights during RLHF.”	True
4. In reinforcement learning, particularly with the Proximal Policy Optimization (PPO) algorithm, what is the role of KL-Divergence? Select all that apply.	measures the difference between two probability distributions & enforce a constraint that limits the extent of LLM weight updates
5. Fill in the blanks: When fine-tuning a large language model with human feedback, the action that the agent (in this case the LLM) carries out is ________ and the action space is the _________.	Generating next tokens, vocabulary of all tokens
6. How does Retrieval Augmented Generation (RAG) enhance generation-based models?	By making external knowledge available to the model
7. How can incorporating information retrieval techniques improve your LLM application? Select all that apply.	Improve relevance & Covercome knowledge cut-offs
8. What is a correct definition of Program-aided Language (PAL) models?	Models that offload computational tasks to other programs & Assists programmers in writing code
9. Which of the following best describes the primary focus of ReAct?	Enhancing language understanding and decision making in LLMs
10. What is the main purpose of the LangChain framework?	To chain together different components