Understanding the 77-token limitation
The Stable Diffusion (v1.5) text encoder uses the CLIP encoder from OpenAI [2]. The CLIP text encoder has a 77-token limit, and this limitation propagates to the downstream Stable Diffusion. We can reproduce the 77-token limitation with the following steps:
- We can take out the encoder from Stable Diffusion and verify it. Let’s say we have the prompt
a photo of a cat and dog driving an aircraft
and we multiply it by 20 to make the prompt’s token size larger than 77:prompt = "a photo of a cat and a dog driving an aircraft "*20
- Reuse the pipeline we initialized at the beginning of the chapter and take out
tokenizer
andtext_encoder
:tokenizer = pipe.tokenizer
text_encoder = pipe.text_encoder
- Use
tokenizer
to get the token IDs from the prompt:tokens = tokenizer(
    prompt,
    truncation = False,
    return_tensors = 'pt'
)["input_ids"...