Skip to content

bug: missing text from parsed pdf #75

Open
@fede-bello

Description

@fede-bello

I’ve encountered an issue where some text is missing after parsing certain PDFs. In the attached example, the text USD 700 disappears during the parsing process.

In the next code, the pages still have all the content:

soup = BeautifulSoup(str(tika_html_doc), "html.parser")
print("Soup", soup)
meta_tags = soup.find_all("meta")
title = None
for tag in meta_tags:
    if tag["name"].endswith(":title"):
        title = tag["content"]
        break
pages = soup.find_all("div", class_=lambda x: x in ["page"])
print("Pages", pages)

However, the blocks inside the parsed document are missing some content:

parsed_doc = visual_ingestor.Doc(pages, ignore_blocks, render_format)
print("Parsed_doc", parsed_doc.blocks)

I wasn’t able to debug it completely, but I believe the problem lies within the parse function. I’m not certain if this is a bug or if BeautifulSoup is misinterpreting USD 700 as a header when it clearly isn’t. The main problem is that in this example the ignored text was a kinda important title, so it was nothing resembling a Header really.

Any help is appreciated

Example pdf:

Here is the pdf that has been causing me problems. It's not complete for privacy reasons, but it's the minimum example I found. that causes this problem. If I edit it a little bit, for example adding text next to the USD it won't cause this problem.

problematic-pdf.pdf

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions