Parsing documents with Document AI
Instead of using PyPDF for loading and parsing PDF documents, we can use Document AI. Document AI is a managed Google Cloud service that helps with “document processing and understanding” [4]. Document AI is recommended for scanned PDFs, or ones with complex layouts or a mix of images and text. It offers various processors for extracting data PDF files (from generic PDF parsers to specialized expense or invoice parsers). LangChain integrates this Google Cloud service through the DocAIParser
class:
from langchain_google_community import DocAIParser parser = DocAIParser( location=LOCATION, processor_name=PROCESSOR_NAME, gcs_output_path=GCS_OUTPUT_PATH )
Before using the class, you need to do the following:
- Create a Document AI processor (choose a pre-trained or custom one). You can read more in the documentation on how to create one (either programmatically...