Ideas for multilingual document chunking logic? #4
Isaac24Karat
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
As part of my Agentic RAG system, I’ve been working on improving how the pipeline handles multilingual documents — especially long PDFs or scraped pages.
⚙️ Current setup:
Each user query is routed through agents (retriever → translator → verifier → synthesizer)
Documents can be in French, German, or Hebrew — sometimes mixed
I'm chunking using LangChain’s RecursiveCharacterTextSplitter + language detection before embedding
🧠 The problem:
Chunking rules don’t always respect sentence structure or paragraph logic in non-English text.
This results in:
Overlapping or broken thoughts
Irrelevant matches during retrieval
Translator agents getting context fragments that don’t make sense
✅ What I’m exploring:
Chunk by sentence instead of fixed char size (language-dependent)
Run per-language tokenizers before splitting
Use metadata (like subtitles or tags) as section anchors
Train a domain-specific chunker for legal or travel documents
❓What I’d love feedback on:
Have you dealt with multilingual chunking in RAG pipelines?
Do you prefer chunk → translate or translate → chunk?
Is there a tokenizer or logic that worked surprisingly well for non-English data?
Any advice or shared patterns appreciated 🙏
I’m especially curious about chunking for Hebrew, French, and mixed-language input.
Repo: agentic-rag-system
Beta Was this translation helpful? Give feedback.
All reactions