Summary
In this chapter, we discussed how to construct a multimodal input on LangChain. We also learned how to use Google’s Imagen foundational model on LangChain for various image-related use cases (such as visual question answering or visual captioning).
Then, we looked into multimodal RAGs. We learned about multimodal embeddings and ways to extract images from raw documents. We also discussed various options that exist to include images in the RAG (from finding the passing images up to adding them into the context).
Finally, we explored parsers available for image understanding on LangChain based on Google Vision API.
In the previous few chapters, we discussed various aspects of building RAG, deepened our understanding of how to use LLMs for various scenarios, and learned how to develop applications with LangChain. In the next chapter, we’ll look at other generative AI use cases. We’ll start with summarization one, and we’ll talk about how to...