Skip to content

Training data should include bullet-like characters #45

@wollmers

Description

@wollmers

Modern texts especially business documents contain bullet-like symbols e. g. for lists. Also middle dot is used with some frequency. While the recognition results for eng and deu are nearly perfect, the results for these symbols are "random".

For a next release of trained models the training data should be improved in this direction and maybe other symbols as well.

Test image:

bullets

Tesseract result with -l eng:

List of vehicles:
* Trucks
* vans
* bicycles
Liste von Fahrzeugen:
e Lastwagen
e Transporter
e Fahrrader

Result with -l deu:

List of vehicles:
« Trucks
« vans
+ bicycles
Liste von Fahrzeugen:
e Lastwagen
e Transporter
e Fahrräder

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions