-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Description
Bug
When using RapidOCR through Docling, specifying a custom latin rec_keys.txt file still results in incorrect Chinese characters being returned on French documents.
I tested this using the Torch engine with the PP-OCRv3 latin .pth files, but I also reproduced the issue using other engines — so the problem is not engine-specific.
After investigation, the issue comes from a mismatch between the parameter name Docling sends to RapidOCR and what RapidOCR actually expects.
Docling passes:
"Rec.keys_path": rec_keys_pathBut RapidOCR expects:
"Rec.rec_keys_path"(See RapidOCR code in rapidocr/ch_ppocr_rec/main.py, line 51.)
Because of this mismatch, RapidOCR ignores the provided latin keys file and instead falls back to the default dictionary inside the package:
ppocr_keys_v1.txt
This default file contains Chinese characters, which explains why Chinese characters appear in the OCR output even when using latin models.
I confirmed that manually overriding the parameter to "Rec.rec_keys_path" fixes the issue and yields correct latin OCR results.
Steps to reproduce
- Use Docling with RapidOCR enabled.
- Use the Torch engine with PP-OCRv3 latin
.pthrecognition models
(also reproducible with other engines). - Provide a latin recognition keys file (
rec_keys.txtor equivalent). - Run OCR on a French document.
- Observe that the output contains Chinese characters.
- Override RapidOCR parameters with
{"Rec.rec_keys_path": "<path_to_keys>"}. - Run OCR again → latin characters now appear correctly.
Docling version
2.63.0
Python version
3.12