Coursera

Week 3

Reinforcement Learning from Human Feedback (RLHF)

Models may behaving badly:

Instruct finetuned LLM -> RLHF -> Human-aligned LLM

Reinforcement Learning

------------>	Agent ---------------
|				  |                 |
| state s      	  |	reward r	    | action a
----------- Environment <------------

Human role: scorer (score the dataset to evaluate).

Obtaining feedback from humans

Collect human feedback:

Prepare labeled data for training:

Reward model

Train reward model to predict preferred completion from ${ y_j , y_k }$ for prompt $x$

Use reward model as a binary classifier to provide reward value for each prompt-completion pair

Use reward model to finetune LLM with RL

									Update
Prompt dataset -> Instruct LLM <------------------ RL Algorithm
                       |                               |
					   |--------> Reward model --------|

RL Algorithm example: Proximal Policy Optimization (PPO)

Promixmal Policy Optimization

Phases

  1. Create completions
    • Create completions from prompts
    • Calculate rewards
    • Calculate value loss.
  2. Model update
    • Calculate policy loss
    • Calculate entropy loss
    • Objective function: $L^{PPO} = L^{POLICY} + c_1L^{VF} + c_2L^{ENT}$

Reward hacking

Example: Toxicity Reward Model - the completion will eventually get lower and lower toxicity, until a point it starts to generate not-meaningful and not useful phrases.

Avoid: using original model with freezed weights as reference model, then compare two completions with KL Divergence to get a penalty score and adding it to the PPO.

Scaling human feedback

Requires a lot of human scorer.

Constitutional AI:

Helpful LLM -> Red Teaming -> Response, critique and revision -> Finetuned LLM -> Generate responses to “Red Teamming” prompts -> Ask model which response is preferred? -> Reward model -> Finetune LLM with Preferences -> Constitutional LLM

Model optimization for deployment

Time and Effort in the lifecycle

png

Using the LLM in applications

LLMs can also suffer from

A way:

User app -> orchestration library -> LLM
				|
			External applications / web

Retrieval augmented generation (RAG): external data is a vector store -> making external knowledge available to the model.

Data preperation for RAG:

  1. Data must fit inside context window
  2. Data must be in format that allows its relevance to be assessed at inference time: embedding vectors

Interaction with external applications

Requirements:

Prompt structure is important!

Helping LLMs reason and plan with chain-of-thought

LLMs can struggle with complex reasoning problems.

Chain-of-Thought Prompting: breaking the original prompt into a chain of clues.

Program-aided language model (PAL)

Orchestration library will interact with Python interpreter to calculate complex results.

ReAct: Combining reasoning and action

ReAct: Syergizing Reasoning and Action in LLMs (LLM + websearch API)

Instructions -> ReAct example <- Question to be answered
					|
					v
			LLM to get the answer

LLM application architectures

Building generative applications

png

Responsible AI

Responsibly build and use generative AI models

Ongoing research

Quiz

Question Answer
1. Which of the following are true in regards to Constitutional AI? Select all that apply. Red Teamming & To obtain revised answers & choose between responses
2. What does the “Proximal” in Proximal Policy Optimization refer to? The constraint that limits the distance between the new and old policy
3. “You can use an algorithm other than Proximal Policy Optimization to update the model weights during RLHF.” True
4. In reinforcement learning, particularly with the Proximal Policy Optimization (PPO) algorithm, what is the role of KL-Divergence? Select all that apply. measures the difference between two probability distributions & enforce a constraint that limits the extent of LLM weight updates
5. Fill in the blanks: When fine-tuning a large language model with human feedback, the action that the agent (in this case the LLM) carries out is ________ and the action space is the _________. Generating next tokens, vocabulary of all tokens
6. How does Retrieval Augmented Generation (RAG) enhance generation-based models? By making external knowledge available to the model
7. How can incorporating information retrieval techniques improve your LLM application? Select all that apply. Improve relevance & Covercome knowledge cut-offs
8. What is a correct definition of Program-aided Language (PAL) models? Models that offload computational tasks to other programs & Assists programmers in writing code
9. Which of the following best describes the primary focus of ReAct? Enhancing language understanding and decision making in LLMs
10. What is the main purpose of the LangChain framework? To chain together different components