Some feedback and requests #64

gabriel-wainmann · 2024-09-09T00:14:35Z

gabriel-wainmann
Sep 9, 2024

Hi Docling team. This is nothing less than wonderful. Thank you.

Running this on a two-page scanned pdf, I get these errors:

The first and last lines of the entire PDF are not OCRed at all.
Rows merged across columns in a table get repeated for each column.

Please, can you add other OCR options to choose from, such as easyocr and tesseract?

dolfim-ibm · 2024-09-09T06:11:50Z

dolfim-ibm
Sep 9, 2024
Maintainer

Hi @gabriel-wainmann, I think that what you see is the (current) expected behavior. For both your points, I assume you are referring to the markdown output, correct?

When Docling detects page headers and footers, those are removed from the markdown, because they are not part of the "natural text flow". The content should anyway be there in the JSON output.
Merge columns headers (spanning multiple columns) don't have a native representation in markdown, this is why we, on-purpose, replicate the header for all columns it belongs to. The output is a regular 2d grid which can easily be iterated on.
- On the other hand, the JSON format contains details about which cells are merged together.
- We are working on an example which exports the tables in HTML and Pandas Dataframes, where these relationships will be represented correctly.

2 replies

aeamaea Sep 10, 2024

We are working on an example which exports the tables in HTML and Pandas Dataframes, where these relationships will be represented correctly.

@dolfim-ibm That would be amazing! I would love to be able to export tables with each table identifiable by it's header or something so I can iterate on them in the JSON or the intermediate format in the dataframe.

dolfim-ibm Sep 18, 2024
Maintainer

@gabriel-wainmann @aeamaea This was just finalized. You can find an example in export_tables.py.

gabriel-wainmann · 2024-09-09T06:13:50Z

gabriel-wainmann
Sep 9, 2024
Author

This is wonderful. Thank you so much

…

On Mon, 9 Sept 2024, 16:12 Michele Dolfi, ***@***.***> wrote: Hi @gabriel-wainmann <https://github.com/gabriel-wainmann>, I think that what you see is the (current) expected behavior. For both your points, I assume you are referring to the markdown output, correct? 1. When Docling detects page headers and footers, those are removed from the markdown, because they are not part of the "natural text flow". The content should anyway be there in the JSON output. 2. Merge columns headers (spanning multiple columns) don't have a native representation in markdown, this is why we, on-purpose, replicate the header for all columns it belongs to. The output is a regular 2d grid which can easily be iterated on. - On the other hand, the JSON format contains details about which cells are merged together. - We are working on an example which exports the tables in HTML and Pandas Dataframes, where these relationships will be represented correctly. — Reply to this email directly, view it on GitHub <#64 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOTQV4NTJJVMJGJCT2O7JUTZVU33XAVCNFSM6AAAAABN3N46ASVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTANJYGY4TCOA> . You are receiving this because you were mentioned.Message ID: ***@***.***>