Using Imagen with LangChain
When discussing multimodal tasks, we should mention Imagen, a text-to-image diffusion model developed by Google Research and Google Brain [3]. It is available at VertexAI, and we can use it with LangChain.
Let’s start with a simple funny image (available as a sample at a public GCS bucket):
Figure 6.1 – Cat Humor
Imagen on VertexAI covers a few use cases, and for each of them ,we need a slightly different model’s class on LangChain with a different interface:
- Visual captioning, or generating text description from an image, is one option. It’s a common problem statement in ML that generates text descriptions from an image:
from langchain_google_vertexai import VertexAIImageCaptioning response = VertexAIImageCaptioning().invoke(image_url) print(response) >> a cat yawning with the caption wake up human
- Visual question and answering (Q&A), or answering a question based on image...