Description
Bug
In v2 version the processor is not able to get data out from a table with in a table in a consistent manner. Here is the detailed anlaysis:
Here’s a concise report on what’s happening with the missing text in your PDF-to-Markdown pipeline:
When converting PDFs to Markdown using Docling, some table cell texts are missing in the final output. For example, the line "Minimum 9 MOB on us performance."
is present in the extracted data but is missing from the Markdown and intermediate outputs. This appears to be caused by the TableFormer model (Table Structure Model) dropping or merging certain input cell texts.
Steps to Reproduce
- Run the Docling pipeline on Table.pdf
- Observe that the text is present in the initial extraction and preprocessing stages.
- Check the debug logs for the table structure model:
- The text is present in the input tokens to the model.
- The text is missing from the model’s output table cells.
- The final Markdown output does not contain the missing line.
Debug Evidence
- Input tokens to Table Structure Model:
"Minimum 9 MOB on us performance."
is present. - Output table cells from TableFormer:
The line is missing; - No code logic drops the line:
The drop occurs inside the TableFormer model’s prediction step (multi_table_predict
).
What I’ve Tried
- Verified that the text is present in all early pipeline stages.
- Added debug logging to compare input tokens and output table cells.
- Confirmed that the loss occurs inside the TableFormer model, not in Docling’s own code.
Expected Behavior
All input cell texts, especially short or bullet-point lines, should be preserved in the output table cells unless there is a clear, documented reason for merging or dropping them.
Actual Behavior
The TableFormer model drops input cell texts, resulting in missing lines in the final output.
Request
- Is there a way to configure TableFormer to be less aggressive in merging/dropping lines?
- Can the model be updated or retrained to preserve all input cell texts?
- Any recommended workarounds for ensuring all extracted lines are present in the output?
Thank you for your help!
Let me know if you need any debug logs
Docling version
v2
...
Python version
3.13.5
...
Please help us.