Skip to content

HTML <p> Tags in Table Cell Merged Incorrectly in Markdown #1927

Open
@MahmoudAtef4499

Description

@MahmoudAtef4499

Bug Description

When converting HTML tables to Markdown, if a table cell contains multiple <p> tags, the output merges their content without spacing, resulting in incorrect values.

Example:

This HTML cell:

<td><p>3</p><p>1</p></td>

is currently converted to:

| 31 |

instead of preserving the structure like:

| 3<br>1 |

or:

| 3  
  1 |

Expected Behavior

  • The Markdown converter should preserve the semantic line breaks or paragraph separations within a table cell.
  • Multiple <p> tags should not be flattened into a single string with no delimiter.

Current Behavior

  • The converter merges multiple <p> tags into one line, resulting in values like 31 instead of maintaining their structure or indicating separation.
  • This misrepresents the actual data from the HTML source.

Environment

  • Python: 3.12
  • Docling version: Latest

I will attach the HTML file that reproduces the issue.

174627142405997939927c_page_111.zip

Thank you very much in advance for your time and support.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions