Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repeated row names #238

Open
xinyaohuu opened this issue Nov 4, 2024 · 1 comment
Open

Repeated row names #238

xinyaohuu opened this issue Nov 4, 2024 · 1 comment

Comments

@xinyaohuu
Copy link

When I converted the tables to markdown, I noticed repeated names, and it happened to many tables not just one. Do you have any suggestions?
Image:
Screenshot 2024-11-04 at 4 50 22 PM
Markdown output
Screenshot 2024-11-04 at 4 51 08 PM

@cau-git
Copy link
Contributor

cau-git commented Nov 5, 2024

@xinyaohuu Thanks for the question. To clarify, in Markdown, this is actually expected behaviour, because the Employee Name / Address is detected as a multi-column cell with a span of 3. However, Markdown has no native way to represent cell spans, hence we resort to repeating the same values three times in this case. This makes sense when you need to address e.g. the column header from a data point in the table.

If you want to see an accurate representation considering cell spans, you can use the JSON output or export the tables to HTML instead. The output will look like this.

image

The minimal code to get the HTML tables is:

source = "your_table_doc.pdf"
converter = DocumentConverter()
result = converter.convert(source)

for table in result.document.tables:
    print("===============")
    print(table.export_to_html())

The screenshot above has this HTML source: sample_table.html.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants