Document chunking
LangChain’s TextSplitter
interface helps to break down large documents into smaller chunks for processing. As we discussed in the previous chapter, it’s one of the patterns for developing RAG applications (since it helps to keep the input context small). The choice of splitter depends on the document’s structure and your specific requirements.
RecursiveCharacterTextSplitter
A versatile option is RecursiveCharacterTextSplitter
. It intelligently divides text based on natural breaks such as paragraphs, sentences, and individual words:
from langchain.text_splitter import RecursiveCharacterTextSplitter text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, # Adjust chunk size as needed chunk_overlap=200 # Optional overlap for context preservation )
In this example, the splitter aims for chunks of approximately 1,000 characters, allowing for a 200-character overlap between...