Skip to content

Add SPLIT_PAGE_CONTENT_ONLY output mode to HTML export #1941

Open
@bharathk01

Description

@bharathk01

Feature Request: Add SPLIT_PAGE_CONTENT_ONLY output mode to HTML export

I'm requesting an enhancement to the HTML export functionality in docling_core. Currently, the split page view (SPLIT_PAGE) outputs each page as a table row, with an image and content. For some use cases, table structure and images are unnecessary, and it would be useful to have page-wise content separated in <div class='page'> blocks, each with a data-page attribute for page numbering, but without images or table markup.

Proposed solution:

  • Add a new output style (SPLIT_PAGE_CONTENT_ONLY) to the HTMLOutputStyle enum.
  • In the HTMLDocSerializer.serialize_doc method, implement a block that:
    • Splits the document per page (like current split logic),
    • Outputs each page as <div class='page' data-page='N'>...</div> (where N is the page number),
    • Does not include images or table structure.

This would allow users to directly export clean, page-wise HTML content for further web embedding or processing.

Suggested code changes:

  • Extend the HTMLOutputStyle enum.
  • Add corresponding logic in the serializer as outlined above.

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions