Skip to content

OcrdPage: propagate inherited attributes and TextStyle #698

@bertsky

Description

@bertsky

PAGE-XML features an implicit inheritance relation between various elements of the hierarchy:

  • Page/TextStyle → TextRegion*/TextStyle → TextLine/TextStyle → Word/TextStyle → Glyph/TextStyle
  • TextRegion*/@production → TextLine/@production → Word/@production → Glyph/@production
  • Page/@primaryScript → TextRegion*/@primaryScript → TextLine/@primaryScript → Word/@primaryScript → Glyph/@script
  • Page/@secondaryScript → TextRegion*/@secondaryScript → TextLine/@secondaryScript → Word/@secondaryScript → Glyph/@script
  • Page/@primaryLanguage → TextRegion*/@primaryLanguage → TextLine/@primaryLanguage → Word/@language
  • Page/@secondaryLanguage → TextRegion*/@secondaryLanguage → TextLine/@secondaryLanguage → Word/@language
  • Page/@readingDirection → TextRegion*/@readingDirection → TextLine/@readingDirection → Word/@readingDirection
  • Page/@textLineOrder → TextRegion*/@textLineOrder

These relations are only documented and cannot be automatically implemented in a generated DOM. But their semantics are important, and it would make writing processors much easier if they would be implemented.

For example, if I want to know if the current segment belongs to a certain script, I'd currently have to:

  1. check the element type, what kind of attribute name applies (@script or @primaryScript / @secondaryScript)
  2. check if that is set locally
  3. otherwise check the parent element's @primaryScript etc

This is very hard to achieve with XPath (because disjunction/unions are only possible on nodesets, not on predicates). And with the DOM it requires a lot of code each time.

But we could facilitate this by simply propagating all inherited features during .build() – in a patched ocrd_page_generateds. We already have the user methods mechanism for patching, and we could simply use buildChildren to propagate all of the above attributes (as a bottom up post-hook), because attributes of parents are built before those of children.

But for TextStyle, it's more complicated: on all hierarchy levels except the Page level, TextStyle sorts after the logical children and thus is only built after they are built. Also, one would need to unify style attributes between levels (we usually have True, False and None; so true/false from parents replaces none in children).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions