Ideas for multilingual document chunking logic? #4

Isaac24Karat · 2025-05-06T07:01:20Z

Isaac24Karat
May 6, 2025
Maintainer

As part of my Agentic RAG system, I’ve been working on improving how the pipeline handles multilingual documents — especially long PDFs or scraped pages.

⚙️ Current setup:
Each user query is routed through agents (retriever → translator → verifier → synthesizer)

Documents can be in French, German, or Hebrew — sometimes mixed

I'm chunking using LangChain’s RecursiveCharacterTextSplitter + language detection before embedding

🧠 The problem:
Chunking rules don’t always respect sentence structure or paragraph logic in non-English text.
This results in:

Overlapping or broken thoughts

Irrelevant matches during retrieval

Translator agents getting context fragments that don’t make sense

✅ What I’m exploring:
Chunk by sentence instead of fixed char size (language-dependent)

Run per-language tokenizers before splitting

Use metadata (like subtitles or tags) as section anchors

Train a domain-specific chunker for legal or travel documents

❓What I’d love feedback on:
Have you dealt with multilingual chunking in RAG pipelines?

Do you prefer chunk → translate or translate → chunk?

Is there a tokenizer or logic that worked surprisingly well for non-English data?

Any advice or shared patterns appreciated 🙏
I’m especially curious about chunking for Hebrew, French, and mixed-language input.

Repo: agentic-rag-system

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ideas for multilingual document chunking logic? #4

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Ideas for multilingual document chunking logic? #4

Uh oh!

Isaac24Karat May 6, 2025 Maintainer

Replies: 0 comments

Isaac24Karat
May 6, 2025
Maintainer