Skip to content

Hyphenated words #22

@jbarth-ubhd

Description

@jbarth-ubhd

Dear reader,
does keraslm-rate take hyphenated words into account?

Using this demo file https://digi.ub.uni-heidelberg.de/diglitData/v/keraslm/test-fouche10,5-s1.pdf

It seems that many of the low rated words have hyphens:

With hyphenation:

# median: 0.962098 0.622701 ; mean: 0.948695 0.625144, correlation: 0.315179
# OCR-D-OCR OCR-D-KERAS
0.693236 0.410939  # region0002_line0021_word0003 daf3
0.927003 0.468318  # region0002_line0029_word0006 Rä-
0.932888 0.480686  # region0002_line0021_word0002 Lyon,
0.904642 0.484226  # region0002_line0032_word0001 Kerker.
0.909297 0.484817  # region0002_line0032_word0004 klaubt
0.931271 0.489822  # region0002_line0000_word0005 pas-
0.928169 0.491138  # region0000_line0004_word0007 sozia-
0.927566 0.492916  # region0002_line0014_word0003 Pythia;
0.958217 0.494058  # region0000_line0002_word0003 Lyon,
0.963757 0.494978  # region0003_line0001_word0005 Lyon,
0.926153 0.495819  # region0003_line0000_word0004 Kon-
0.960306 0.496031  # region0002_line0010_word0007 Lyon
0.911557 0.496326  # region0002_line0001_word0004 Rousseaus
0.967390 0.496934  # region0000_line0011_word0003 1792
0.929831 0.497394  # region0002_line0004_word0003 im
0.960453 0.498529  # region0002_line0017_word0006 Lyon
0.910209 0.499826  # region0002_line0018_word0002 Instinktiv
...

Without (manually removed) hyphenation:

# median: 0.962198 0.623943 ; mean: 0.949162 0.628181, correlation: 0.278264
# OCR-D-OCRNOHYP OCR-D-KERNOHYP
0.693236 0.411037  # region0002_line0021_word0003 daf3
0.932888 0.480686  # region0002_line0021_word0002 Lyon,
0.904642 0.484226  # region0002_line0032_word0001 Kerker.
0.909297 0.484817  # region0002_line0032_word0004 klaubt
0.927566 0.492916  # region0002_line0014_word0003 Pythia;
0.958217 0.494058  # region0000_line0002_word0003 Lyon,
0.963757 0.494945  # region0003_line0001_word0005 Lyon,
0.960306 0.496031  # region0002_line0010_word0007 Lyon
0.911557 0.496306  # region0002_line0001_word0004 Rousseaus
0.967390 0.496923  # region0000_line0011_word0003 1792
0.929831 0.497394  # region0002_line0004_word0003 im
0.960453 0.498542  # region0002_line0017_word0006 Lyon
0.910209 0.499822  # region0002_line0018_word0002 Instinktiv
...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions