Unifying RLHF Objectives

What are different RL algorithms actually doing?

Published January 2024

Reinforcement learning from human preferences (RLHF) tries to teach language models to optimize for human preferences, rather than the supervised perplexity from pretraining. It does so by collecting a dataset of language model outputs, and then having humans rate which output is better (Do you prefer answer A or B?). Here, I describe different commonly-used RLHF algorithms in terms of their gradient.

Consider the problem of optimizing a language model $π_{θ}$ from a preference dataset consisting of context $x$ and two completions: the chosen completion $y_{+}$ and the rejected completion $y_{-}$ . These represent, eg, two different possible responses from a chatbot, and chosen by a human.

We can view different RLHF algorithms by considering the gradient of their loss function:

- \nabla_{θ} L (π_{θ}) = w_{+} \nabla_{θ} (\log π_{θ} (y_{+} | x)) - w_{-} \nabla_{θ} (\log π_{θ} (y_{-} | x))

Intuitively, these algorithms typically increase the probability of the chosen completion, and decrease the probability of the rejected completion. Different algorithms are differentiated by their choice of $w_{+}$ and $w_{-}$ (for methods which do not operate on paired data, simply consider $w_{-} = 0$ ). They may also use a reward function $r (x, y)$ representing the “elo” of the full completion. Note simplifications are made for ease of comparison.

Summary

Supervised (weight on $\nabla_{θ} \log π_{θ}$ is always positive):

Supervised Finetuning (SFT): $w_{+} = 1$
Conditioned-RL Finetuning (C-RLFT): $w_{+} = 1, w_{-} \approx - 0.1$

Unpaired (increase $w_{+}$ proportional to $r (x, y_{+})$ ; assume $r (x, y_{+}) > 0$ for clarity):

Vanilla Policy Gradient (VPG): $w_{+} = r (x, y_{+})$
Proximal Policy Optimization (PPO): $w_{+} = {\begin{cases} \frac{π_{θ} (y_{+} | x)}{π_{ref} (y_{+} | x)} \cdot r (x, y_{+}) & \frac{π_{θ} (y_{+} | x)}{π_{ref} (y_{+} | x)} < 1 + ϵ \\ 0 & o.w. \end{cases}$
Advantage-Induced Policy Alignment (APA): $w_{+} = r (x, y_{+}) - \log \frac{π_{θ} (y_{+} | x)}{π_{ref} (y_{+} | x)} = r (x, y_{+}) - {\hat{r}}_{θ} (x, y_{+})$
Kahneman-Tversky Optimization (KTO): $w_{+} = σ^{'} (r (x, y_{+}) - E [\log \frac{π_{θ} (y | x)}{π_{ref} (y | x)}])$

Paired (push $y_{+}$ and $y_{-}$ apart):

Unlikelihood: $w_{+} = w_{-} = 1$
Reward Modeling (RM): $w_{+} = w_{-} = σ (r_{θ} (x, y_{-}) - r_{θ} (x, y_{+}))$
Direct Preference Optimization (DPO): $w_{+} = w_{-} = σ ({\hat{r}}_{θ} (x, y_{-}) - {\hat{r}}_{θ} (x, y_{+}))$
Rank Responses to Align Human Feedback (RRHF) / SLiC: $w_{+} = 1 + w_{-}, w_{-} = 1 {\frac{π_{θ} (y_{+} | x)}{π_{θ} (y_{-} | x)} < β}$

Note that unpaired methods may also have negative weights when $r (x, y_{+}) < 0$ . Thus, we can think of them as dynamically choosing which samples should have negative weight, rather than the paired methods which set them directly based on the dataset.

PPO derivation

I only include the derivations for PPO and RM as illustrative examples.

PPO starts from a policy $π_{ref}$ at the beginning of training which generates the dataset used for training, and enforces a KL divergence constraint $K L (π_{θ} | | π_{ref})$ to ensure that $y_{+} \sim π_{θ} (\cdot | x)$ does not diverge too much from the data used to train the reward model. It does this by maximizing:

- L (π_{θ}) = min (\frac{π_{θ} (y_{+} | x)}{π_{ref} (y_{+} | x)}, 1 + ϵ) \cdot r (x, y_{+})

which immediately sets the derivative equal to zero when $\frac{π_{θ} (y_{+} | x)}{π_{ref} (y_{+} | x)} > 1 + ϵ$ .

Then, take the derivative for the other case:

- \nabla_{θ} L (π_{θ}) = \frac{1}{π_{ref} (y_{+} | x)} \cdot r (x, y_{+}) \cdot \nabla_{θ} (π_{θ} (y_{+} | x))

We use the “policy gradient trick” from the chain rule, $\nabla_{x} f (x) = f (x) \nabla_{x} \log f (x)$ , which yields the final gradient:

w_{+} = {\begin{cases} \frac{π_{θ} (y_{+} | x)}{π_{ref} (y_{+} | x)} \cdot r (x, y_{+}) & \frac{π_{θ} (y_{+} | x)}{π_{ref} (y_{+} | x)} < 1 + ϵ \\ 0 & o.w. \end{cases}

One can perform a similar derivation for the $1 - ϵ$ side of the PPO surrogate objective. This one-sided derivation is not exactly right, but captures the spirit of the maximization.

We can see that, compared to APA, PPO maintains a positive $w_{+}$ until the $1 + ϵ$ ratio is hit – enforcing the KL divergence irrespective of $r$ – while APA has positive $w_{+}$ until the log-ratio is equals the reward.

Reward modeling derivation

In this case, we are considering only the task of training the reward model $r_{θ} (x, \cdot)$ from preference data; thus, we consider the derivative which has parameters with respect to $r$ , rather than the policy $π$ .

Using the Bradley-Terry model for pairwise comparisons (where $r_{θ}$ can be interpreted as an “elo”), we optimize the objective:

- L (r_{θ}) = \log p_{θ} (y_{+} > y_{-} | x) = \log σ (r_{θ} (x, y_{+}) - r_{θ} (x, y_{-}))

We utilize some useful properties of the sigmoid function:

$σ (x) = 1 - σ (- x)$
$\nabla_{x} σ (x) = σ (x) (1 - σ (x)) = σ (x) σ (- x)$ (by applying (1))
$\nabla_{x} \log σ (x) = σ (- x)$ (by applying the chain rule and (2))

This thus yields:

\begin{aligned} - \nabla_{θ} L (π_{θ}) & = \nabla_{θ} \log σ (r_{θ} (x, y_{+}) - r_{θ} (x, y_{-})) \\ = σ (r_{θ} (x, y_{-}) - r_{θ} (x, y_{+})) \nabla_{θ} (r_{θ} (x, y_{-}) - r_{θ} (x, y_{+})) \\ = σ (r_{θ} (x, y_{-}) - r_{θ} (x, y_{+})) (\nabla_{θ} (r_{θ} (x, y_{+})) - \nabla_{θ} (r_{θ} (x, y_{-}))) \end{aligned}

which completes the derivation with $w_{+} = w_{-}$ . DPO follows a similar derivation using their implicit reward ${\hat{r}}_{θ} = \log \frac{π_{θ} (y | x)}{π_{ref} (y | x)}$ which intuitively means the policy $π_{θ}$ “values” $y$ proportional to its log-probability.

We can see DPO has a very similar formulation to APA, where both aim to softly increase $π_{θ} (y_{+} | x)$ until $π_{θ} (y_{+} | x) = e^{r (x, y_{+})} \cdot π_{ref} (y_{+} | x)$ . Then this is very similar to the PPO objective, except it has a hard clip once the log-ratio exceeds $1 + ϵ$ . RRHF also uses a hard clip, but replaces $π_{ref} (y_{+} | x)$ with $π_{θ} (y_{-} | x)$ , ensuring $π (y_{+} | x) \geq π (y_{-} | x)$ .

In contrast, C-RLFT / Decision Transformer-style methods do not “push down” the $w_{-}$ term; rather, they condition on some notion of negative reward. Therefore, the suboptimal behavior is still in the model, but must be solicited via a negative prompt.

Commentary

I chose the above methods because they have been used to train top models on the Chatbot Arena benchmark:

SFT is present in many models, including Hermes
PPO is used in top pretrained foundation models such as ChatGPT and Gemini
C-RLFT is used in OpenChat, the top 7B model as of Feb 2024 (used to initialize Starling)
APA is the final stage of Starling, which builds on OpenChat
DPO is popular in the open-source community, but performs relatively poorly in Chatbot Arena, with its best 7B model being Zephyr
The authors of RRHF went on to build Qwen, which at the time of writing is the top open-source model on the leaderboard

Ultimately their objective functions are conceptually very similar and performant after tuning, and obviously the real power is in the dataset (and the weighting of it).

Notes mentioning this note

LoRAs as Composable Programs

There is a growing trend to think of large language models (LLMs) as operating systems (OS). They have the ability...

Spending Inference Time

Inference is king: at the end of the day, the user wants a mapping from the input query to their...

Summary
PPO derivation
Reward modeling derivation
Commentary

Notes mentioning this note

LoRAs as Composable Programs

There is a growing trend to think of large language models (LLMs) as operating systems (OS). They have the ability...

Spending Inference Time

Inference is king: at the end of the day, the user wants a mapping from the input query to their...

Unifying RLHF Objectives

Summary

PPO derivation

Reward modeling derivation

Commentary

Notes mentioning this note

Table of Contents

Notes mentioning this note