Multimodality
Information can be presented in various modalities – for example, text, images, videos, audio, and so on. Typically, machine learning (ML) models deal with a single modality. For example, they might take a video input and provide a text description of this video as an output. Imagine how great it would be if you could ask a large language model (LLM) a question about a specific image. In that case, your input would become both a text and an image (or maybe only a text or an image). Multimodality is the capability of ML models to take as input or produce as output data of various modalities at the same time (e.g., text, image, video, audio, etc.), and multimodal models can deal with different modalities at the same time (either on the input or output side, or both). [1]
Our documents often contain multimodal input (a very simple example – images or charts in a document that have an important meaning). Imagine how better our retrieval augmented generation...