Developing multimodal RAGs
In the previous chapter, we discussed a classic example of RAG: a Q&A application on enterprise data. Often, the source of the data is PDFs that contain images that incorporate important – pie charts, graphs, and other types of visualizations.
We have two problems in front of us – first, we have to determine how to extract images from underlying objects. Second, if we have text and images from the underlying document, we need to know how to prepare the context for the LLM. Let’s look at these problems one by one. Of course, you can expand this approach to other types of content.
Extracting images from PDF documents
Ideally, we should have images as a separate source of data for our RAG applications. However, in practice, images are often a part of PDF files and other unstructured data sources, and we’d like our Q&A application to take them into account. That means we need to extract them during the pre-processing...