Open
Description
Feature Request: Add SPLIT_PAGE_CONTENT_ONLY output mode to HTML export
I'm requesting an enhancement to the HTML export functionality in docling_core
. Currently, the split page view (SPLIT_PAGE
) outputs each page as a table row, with an image and content. For some use cases, table structure and images are unnecessary, and it would be useful to have page-wise content separated in <div class='page'>
blocks, each with a data-page
attribute for page numbering, but without images or table markup.
Proposed solution:
- Add a new output style (
SPLIT_PAGE_CONTENT_ONLY
) to theHTMLOutputStyle
enum. - In the
HTMLDocSerializer.serialize_doc
method, implement a block that:- Splits the document per page (like current split logic),
- Outputs each page as
<div class='page' data-page='N'>...</div>
(whereN
is the page number), - Does not include images or table structure.
This would allow users to directly export clean, page-wise HTML content for further web embedding or processing.
Suggested code changes:
- Extend the
HTMLOutputStyle
enum. - Add corresponding logic in the serializer as outlined above.
Thank you!