Models may behaving badly:
Instruct finetuned LLM -> RLHF -> Human-aligned LLM
------------> Agent ---------------
| | |
| state s | reward r | action a
----------- Environment <------------
Human role: scorer (score the dataset to evaluate).
Collect human feedback:
Prepare labeled data for training:
Train reward model to predict preferred completion from ${ y_j , y_k }$ for prompt $x$
Use reward model as a binary classifier to provide reward value for each prompt-completion pair
Update
Prompt dataset -> Instruct LLM <------------------ RL Algorithm
| |
|--------> Reward model --------|
RL Algorithm example: Proximal Policy Optimization (PPO)
Phases
Example: Toxicity Reward Model - the completion will eventually get lower and lower toxicity, until a point it starts to generate not-meaningful and not useful phrases.
Avoid: using original model with freezed weights as reference model, then compare two completions with KL Divergence to get a penalty score and adding it to the PPO.
Requires a lot of human scorer.
Constitutional AI:
Helpful LLM -> Red Teaming -> Response, critique and revision -> Finetuned LLM -> Generate responses to “Red Teamming” prompts -> Ask model which response is preferred? -> Reward model -> Finetune LLM with Preferences -> Constitutional LLM
LLMs can also suffer from
A way:
User app -> orchestration library -> LLM
|
External applications / web
Retrieval augmented generation (RAG): external data is a vector store -> making external knowledge available to the model.
Data preperation for RAG:
Requirements:
Prompt structure is important!
LLMs can struggle with complex reasoning problems.
Chain-of-Thought Prompting: breaking the original prompt into a chain of clues.
Orchestration library will interact with Python interpreter to calculate complex results.
ReAct: Syergizing Reasoning and Action in LLMs (LLM + websearch API)
Instructions -> ReAct example <- Question to be answered
|
v
LLM to get the answer
Building generative applications
Responsibly build and use generative AI models
Question | Answer |
---|---|
1. Which of the following are true in regards to Constitutional AI? Select all that apply. | Red Teamming & To obtain revised answers & choose between responses |
2. What does the “Proximal” in Proximal Policy Optimization refer to? | The constraint that limits the distance between the new and old policy |
3. “You can use an algorithm other than Proximal Policy Optimization to update the model weights during RLHF.” | True |
4. In reinforcement learning, particularly with the Proximal Policy Optimization (PPO) algorithm, what is the role of KL-Divergence? Select all that apply. | measures the difference between two probability distributions & enforce a constraint that limits the extent of LLM weight updates |
5. Fill in the blanks: When fine-tuning a large language model with human feedback, the action that the agent (in this case the LLM) carries out is ________ and the action space is the _________. | Generating next tokens, vocabulary of all tokens |
6. How does Retrieval Augmented Generation (RAG) enhance generation-based models? | By making external knowledge available to the model |
7. How can incorporating information retrieval techniques improve your LLM application? Select all that apply. | Improve relevance & Covercome knowledge cut-offs |
8. What is a correct definition of Program-aided Language (PAL) models? | Models that offload computational tasks to other programs & Assists programmers in writing code |
9. Which of the following best describes the primary focus of ReAct? | Enhancing language understanding and decision making in LLMs |
10. What is the main purpose of the LangChain framework? | To chain together different components |