Skip to content

"I" (vowel) symbol recognition #1989

@Tulenien

Description

@Tulenien

🚀 The feature

"I" (vowel) symbol is systematically missing from recognition model output when using a vertical image layout with preserve _aspect_ratio option set True.

Motivation, pitch

I have discovered that the model consistently struggles with recognizing the vowel "I" in various positions within a sentence—particularly at the start and end, when the preserve_aspect_ratio parameter in the Pre-Processor is set to True.

Setting preserve_aspect_ratio=False helps mitigate the issue, but only for vertically oriented text.

However, when using horizontally oriented text setting preserve_aspect_ratio=True results in better recognition of "I" symbols occurrences.

I have also tried to change interpolation method in image resize preprocessing stage. While it improves the overall quality of recognition, it does not affect the "I"s.

Alternatives

I have three suggestions of how it is possible to fix that issue:

  1. To use the conditional check to define the value of preserve_aspect_ratio parameter based on the ratio of image sides. Where the detection of horizontal text will lead to using True value and vertical - to False value.

  2. Allow to choose the preserve_aspect_ratio value when calling the model.

  3. To add more horizontal text samples into the datasets and retrain/finetune the detection and recognition models.

Additional context

I have tested the issue with the image composed from different sentences containing I's on the latest version of doctr library (torch).

The picture represents a comparison between two runs of ocr on the same image using different values of preserve_aspect_ratio parameter.

Colors of boxes meaning:

  • the blue color is assigned to preserve_aspect_ratio set to False outlier results;
  • the red color is assigned to preserve_aspect_ratio set to True outlier results;
  • gray color signifies there is no change between runs;
  • other colors represent partial difference.

Two json files attached show the doctr ouput on the same image. In the doctr_ocr_par_False where are 10 "I" occurrences, while in doctr_ocr_par_True where are 5 "I" occurrences.

Image Image

doctr_ocr_par_False.json

doctr_ocr_par_True.json

The basic script used for tests:

from fastapi import FastAPI, UploadFile, File
import numpy as np
from doctr.models import ocr_predictor
from PIL import Image
import io
import torch
import uvicorn

app = FastAPI()

DETECTION_MODEL = "db_resnet50"
RECOGNITION_MODEL = "crnn_mobilenet_v3_large"
PRESERVE_ASPECT_RATIO = False

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = ocr_predictor(
    pretrained=True,
    det_arch=DETECTION_MODEL,
    reco_arch=RECOGNITION_MODEL,
    assume_straight_pages=True,
    preserve_aspect_ratio=PRESERVE_ASPECT_RATIO,
    symmetric_pad=True,
).to(device=DEVICE)

@app.post("/ocr")
async def ocr(file: UploadFile = File(...)):
    image = await file.read()
    await file.close()

    doc = []
    image_pil = Image.open(io.BytesIO(image)).convert("RGB")
    doc.append(np.asarray(image_pil))

    result = model(doc)
    return result.export()


if __name__ == "__main__":
    uvicorn.run("main:app", host="0.0.0.0", port=44556, reload=False

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions