-
Notifications
You must be signed in to change notification settings - Fork 535
Description
🚀 The feature
"I" (vowel) symbol is systematically missing from recognition model output when using a vertical image layout with preserve _aspect_ratio option set True.
Motivation, pitch
I have discovered that the model consistently struggles with recognizing the vowel "I" in various positions within a sentence—particularly at the start and end, when the preserve_aspect_ratio parameter in the Pre-Processor is set to True.
Setting preserve_aspect_ratio=False helps mitigate the issue, but only for vertically oriented text.
However, when using horizontally oriented text setting preserve_aspect_ratio=True results in better recognition of "I" symbols occurrences.
I have also tried to change interpolation method in image resize preprocessing stage. While it improves the overall quality of recognition, it does not affect the "I"s.
Alternatives
I have three suggestions of how it is possible to fix that issue:
-
To use the conditional check to define the value of preserve_aspect_ratio parameter based on the ratio of image sides. Where the detection of horizontal text will lead to using True value and vertical - to False value.
-
Allow to choose the preserve_aspect_ratio value when calling the model.
-
To add more horizontal text samples into the datasets and retrain/finetune the detection and recognition models.
Additional context
I have tested the issue with the image composed from different sentences containing I's on the latest version of doctr library (torch).
The picture represents a comparison between two runs of ocr on the same image using different values of preserve_aspect_ratio parameter.
Colors of boxes meaning:
- the blue color is assigned to preserve_aspect_ratio set to False outlier results;
- the red color is assigned to preserve_aspect_ratio set to True outlier results;
- gray color signifies there is no change between runs;
- other colors represent partial difference.
Two json files attached show the doctr ouput on the same image. In the doctr_ocr_par_False
where are 10 "I" occurrences, while in doctr_ocr_par_True
where are 5 "I" occurrences.


The basic script used for tests:
from fastapi import FastAPI, UploadFile, File
import numpy as np
from doctr.models import ocr_predictor
from PIL import Image
import io
import torch
import uvicorn
app = FastAPI()
DETECTION_MODEL = "db_resnet50"
RECOGNITION_MODEL = "crnn_mobilenet_v3_large"
PRESERVE_ASPECT_RATIO = False
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = ocr_predictor(
pretrained=True,
det_arch=DETECTION_MODEL,
reco_arch=RECOGNITION_MODEL,
assume_straight_pages=True,
preserve_aspect_ratio=PRESERVE_ASPECT_RATIO,
symmetric_pad=True,
).to(device=DEVICE)
@app.post("/ocr")
async def ocr(file: UploadFile = File(...)):
image = await file.read()
await file.close()
doc = []
image_pil = Image.open(io.BytesIO(image)).convert("RGB")
doc.append(np.asarray(image_pil))
result = model(doc)
return result.export()
if __name__ == "__main__":
uvicorn.run("main:app", host="0.0.0.0", port=44556, reload=False