Pretrained Transformers as Universal Computation Engines

What are the limits of transfer of large pretrained language models?

Published March 2021

In this work (arXiv), we found pretraining on language data could lead to nontrivially better performance on non-language tasks – early signs of multimodality inherent in large language models.

What are the limits to the generalization of large pretrained transformer models?

We find minimal fine-tuning (~0.1% of params) performs as well as training from scratch on a completely new modality!

with @_kevinlu, @adityagrover_, @pabbeel
paper: https://t.co/DtWGJ0Afh7

1/8
— Igor Mordatch (@IMordatch) March 10, 2021

Multimodal transfer

Our main result was that you can take GPT-2 (~= 100M parameters), and adapt it to a new modality by finetuning the input, output, and layer-norm parameters. This meant that we could transfer the bulk of the model (the attention and feedforward layers) to the new modality and achieve good performance without finetuning!

One implication of this is that you might think of the language model has having learned capabilities similar to the hippocampus: acting as a general, multimodal sequence processor. Then, perhaps you could use the base language model as the base for future adapters on top, performing reasoning in the space of language. Our initial result was far from this, but later work has pushed this capability, with LLMs such as GPT-4 and Gemini serving as the base for powerful multimodal capabilities.

Follow-up work

Apr 2023 – LLaVA: Imbuing a language model with highly-performant vision capabilities
Mar 2023 – PaLM-E: Turning a large language model into an embodied multimodal model
Jun 2021 – Frozen: No longer necessary to finetune the layer-norm parameters