Handling documents that contain a mix of text and tabular data
Data is not always simple. Many real-world documents, such as research papers, financial reports, and others, contain a mix of unstructured text, as well as structured tabular data in tables. Ingesting such heterogeneous documents presents an additional challenge - we need to not only extract text but also identify, parse, and process tables embedded within the text. Because, sometimes you get tables, sometimes you get text and sometimes you have to deal with a mix of both.
LlamaIndex provides UnstructuredElementNodeParser
to tackle such documents containing both free-form text as well as tables and other structured elements. It leverages the Unstructured
library to analyze the document layout and delineate text sections from tables.
This parser works exclusively on HTML files and can extract two types of nodes:
- Text nodes: Containing the text chunks
- Table nodes: Containing the table data and metadata...