Unifying RLHF Objectives
Reinforcement learning from human preferences (RLHF) tries to teach language models to optimize for human preferences, rather than the supervised perplexity from pretraining. It does so by collecting a dataset of language model outputs, and then having humans rate which output is better (Do you prefer answer A or B?). Here, I describe different commonly-used RLHF algorithms in terms of their gradient.
Consider the problem of optimizing a language model
We can view different RLHF algorithms by considering the gradient of their loss function:
Intuitively, these algorithms typically increase the probability of the chosen completion, and decrease the probability of the rejected completion.
Different algorithms are differentiated by their choice of
Summary
Supervised (weight on
Unpaired (increase
-
Vanila Policy Gradient (VPG):
-
Proximal Policy Optimization (PPO):
-
Advantage-Induced Policy Alignment (APA):
-
Kahneman-Tversky Optimization (KTO):
Paired (push
-
Unlikelihood:
-
Reward Modeling (RM):
-
Direct Preference Optimization (DPO):
-
Rank Responses to Align Human Feedback (RRHF) / SLiC:
Note that unpaired methods may also have negative weights when
PPO derivation
I only include the derivations for PPO and RM as illustrative examples.
PPO starts from a policy
which immediately sets the derivative equal to zero when
Then, take the derivative for the other case:
We use the “policy gradient trick” from the chain rule,
One can perform a similar derivation for the
We can see that, compared to APA, PPO maintains a positive
Reward modeling derivation
In this case, we are considering only the task of training the reward model
Using the Bradley-Terry model for pairwise comparisons (where
We utilize some useful properties of the sigmoid function:
(by applying (1)) (by applying the chain rule and (2))
This thus yields:
which completes the derivation with
We can see DPO has a very similar formulation to APA, where both aim to softly increase
In contrast, C-RLFT / Decision Transformer-style methods do not “push down” the
Commentary
I chose the above methods because they have been used to train top models on the Chatbot Arena benchmark:
- SFT is present in many models, including Hermes
- PPO is used in top pretrained foundation models such as ChatGPT and Gemini
- C-RLFT is used in OpenChat, the top 7B model as of Feb 2024 (used to initialize Starling)
- APA is the final stage of Starling, which builds on OpenChat
- DPO is popular in the open-source community, but performs relatively poorly in Chatbot Arena, with its best 7B model being Zephyr
- The authors of RRHF went on to build Qwen, which at the time of writing is the top open-source model on the leaderboard
Ultimately their objective functions are conceptually very similar and performant after tuning, and obviously the real power is in the dataset (and the weighting of it). The more interesting question is how the objectives enable you to train on different datasets: offline vs online, paired vs unpaired.
Offline vs online training
Open-source methods typically use GPT-4 as training data, which has already undergone online RL optimization; thus, it seems they can get away with offline-only training. However, at the time of writing, it does seem necessary to have some sort of online optimization at some point in the pipeline, and we continue to see online PPO being used to train the world’s largest closed-source models.
Paired vs unpaired data
DPO, furthermore, only trains on paired data; while this enables its simplicity, it is also a limitation. On Huggingface, there is a lack of open-source preference datasets available in coding, math, and reasoning. If we evaluate the performance of Zephyr (DPO) on MT-Bench, we see Zephyr performs worse than its non-preference-constrained 7B open-source models in these categories:
It is relatively easy to obtain preference data if using the GPT-4 API as a judge for your own datasets, but real “human” preferences are likely going to be continually limited in contrast to large closed-source labelling efforts (and finetuning is always limited compared to pretraining). Also – definitionally – unpaired datasets will always have strictly more data than paired ones. Thus, unpaired methods might be more flexible.
Paired reward model training
I think there’s an interesting question here: if the paired preference dataset is lacking in some task (say,
One idea might be that, conditioned on the task
Unpaired training
Another approach is simply the C-RLFT-style method where the model remembers everything (
Anyway, there are a lot of moving pieces involved and required for successful RLHF – I think this is just one interesting perspective on some of them.