Decision Transformer

How can we perform reinforcement learning with autoregressive sequence models?

Published June 2021

In this work (arXiv), we studied how offline reinforcement learning could be performed using conditional sequence modeling, the same approach behind language models and GPT.

Example

The simplest way to think of Decision Transformer now (in the ChatGPT era) is as a chatbot:

You start by saying: “You are an expert in X” or “You have an IQ of 140” (the target return).
You then provide context to the situation, or give it a question (its observation).
The model then responds to your question (its action).

This turned out to be a general emergent phenomena of large models trained on internet data:

Language-conditional models can act a bit like decision transformers, in that you can prompt them with a desired level of "reward".

E.g., want prettier #dalle creations? "Just ask" by adding "[very]^n beautiful":

n=0: "A beautiful painting of a mountain next to a waterfall." pic.twitter.com/vu0NceTxAv
— Phillip Isola (@phillip_isola) June 2, 2022

Here, we have:

Target return: “very” ^ n
Observation: the text prompt
Action: the model’s generated image

Then for n=22: “A very very very very very very very very very very very very very very very very very very very very very very beautiful painting of a mountain next to a waterfall.”

Comparison to imitation learning

One of the crucial questions for us was whether to train on all the data (and condition on expertise), or only the high-quality expert subset: we named the latter “percentile behavior cloning” (%BC).

It turns out on the popular D4RL robotics benchmark, %BC was sufficient to achieve state of the art performance! However, in data-scarce settings such as the Atari benchmark, performance suffered drastically when throwing out data:

I think this is still a really interesting question, and continues to be an active topic of interest, as large foundation models are pretrained on vast low-quality data and then finetuned on high-quality subsets.

Follow-up work

There’s been a lot of interesting papers in this area, but just to highlight a few:

Sep 2023 – OpenChat: Condition on either “expert” GPT-4 or “low quality” GPT-3.5
Oct 2022 – Algorithm Distilation: Learn new tasks in-context via prompting
Aug 2022 – Stable Diffusion: Conditions on an aesthetic score (also: CLIP-DT)

Supplementary links

Dec 2021 – Poster
Jul 2023 – Talk
Jun 2021 – Code

Unaffiliated links:

Dec 2021 – The Batch
Jun 2021 – Yannic Kilcher
Jun 2021 – SyncedReview
Jun 2021 – The Gradient

Notes mentioning this note

Towards a Universal Decision Making Paradigm

My undergraduate research tackled how we could learn new behaviors from data: what are scalable ways we can learn to...

Decision Transformer

In this work (arXiv), we studied how offline reinforcement learning could be performed using conditional sequence modeling, the same approach...

Unifying RLHF Objectives

Reinforcement learning from human preferences (RLHF) tries to teach language models to optimize for human preferences, rather than the supervised...

Link to arXiv page

Example
Comparison to imitation learning
Follow-up work
Supplementary links

Notes mentioning this note

Towards a Universal Decision Making Paradigm

My undergraduate research tackled how we could learn new behaviors from data: what are scalable ways we can learn to...

Decision Transformer

In this work (arXiv), we studied how offline reinforcement learning could be performed using conditional sequence modeling, the same approach...

Unifying RLHF Objectives

Reinforcement learning from human preferences (RLHF) tries to teach language models to optimize for human preferences, rather than the supervised...

Decision Transformer

Example

Comparison to imitation learning

Follow-up work

Supplementary links

Notes mentioning this note

Table of Contents

Notes mentioning this note