Encoders and decoders
Now, I’d like to briefly introduce you to two key topics that you’ll see in the discussion of transformer-based models: encoders and decoders. Let’s establish some basic intuition to help you understand what they are all about. An encoder is simply a computational graph (or neural network, function, or object depending on your background), which takes an input with a larger feature space and returns an object with a smaller feature space. We hope (and demonstrate computationally) that the encoder is able to learn what is most essential about the provided input data.
Typically, in large language and vision models, the encoder itself is composed of a number of multi-head self-attention objects. This means that in transformer-based models, an encoder is usually a number of self-attention steps, learning what is most essential about the provided input data and passing this onto the downstream model. Let’s look at a quick visual:
Figure 1.1 – Encoders and decoders
Intuitively, as you can see in the preceding figure, the encoder starts with a larger input space and iteratively compresses this to a smaller latent space. In the case of classification, this is just a classification head with output allotted for each class. In the case of masked language modeling, encoders are stacked on top of each other to better predict tokens to replace the masks. This means the encoders output an embedding, or a numerical representation of that token, and after prediction, the tokenizer is reused to translate that embedding back into natural language.
One of the earliest large language models, BERT, is an encoder-only model. Most other BERT-based models, such as DeBERTa, DistiliBERT, RoBERTa, DeBERTa, and others in this family use encoder-only model architectures. Decoders operate exactly in reverse, starting with a compressed representation and iteratively recomposing that back into a larger feature space. Both encoders and decoders can be combined, as in the original Transformer, to solve text-to-text problems.
To make it easier, here’s a short table quickly summarizing the three types of self-attention blocks we’ve looked at, encoders, decoders, and their combination.
Size of inputs and outputs |
Type of self-attention blocks |
Machine learning tasks |
Example models |
Long to short |
Encoder |
Classification, any dense representation |
BERT, DeBERTa, DistiliBERT, RoBERTa, XLM, AlBERT, CLIP, VL-BERT, Vision Transformer |
Short to long |
Decoder |
Generation, summarization, question-answering, any sparse representation |
GPT, GPT-2, GPT-Neo, GPT-J, ChatGPT, GPT-4, BLOOM, OPT |
Equal |
Encoder-decoder |
Machine translation, style translation |
T5, BART, BigBird, FLAN-T5, Stable Diffusion |
Table 1.3 – Encoders, decoders, and their combination
Now that you have a better understanding of encoders, decoders, and the models they create, let’s close out the chapter with a quick recap of all the concepts you just learned about.