Multimodality with Gemini and LangChain
First, what does it mean for a model to be multimodal? It means the model that supports text content as an input (or output). For example, some models can support audio, PDFs, images, and other types of content as input (or they can also generate images, audio, video, etc.).
As of September 2024, Google Cloud exposed the Gemini 1.0 Pro Vision, Gemini 1.5 Pro, and Gemini 1.5 Flash models that allowed text, images, audio, video, and PDFs to be included in prompts. Responses are in the form of text or code [2].
Unlike many other models, these models can seamlessly process images and video, together with text, as part of their prompting. This enables them to understand text and visuals and answer questions related to visuals. We will discuss potential use cases later in this chapter. If you need a model to generate text or answer questions based on text (or to perform other text-centric tasks), you can use Gemini Pro 1.0 or other models. If...