Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: missing text from parsed pdf #75

Open
fede-bello opened this issue Jul 12, 2024 · 1 comment
Open

bug: missing text from parsed pdf #75

fede-bello opened this issue Jul 12, 2024 · 1 comment

Comments

@fede-bello
Copy link

fede-bello commented Jul 12, 2024

I’ve encountered an issue where some text is missing after parsing certain PDFs. In the attached example, the text USD 700 disappears during the parsing process.

In the next code, the pages still have all the content:

soup = BeautifulSoup(str(tika_html_doc), "html.parser")
print("Soup", soup)
meta_tags = soup.find_all("meta")
title = None
for tag in meta_tags:
    if tag["name"].endswith(":title"):
        title = tag["content"]
        break
pages = soup.find_all("div", class_=lambda x: x in ["page"])
print("Pages", pages)

However, the blocks inside the parsed document are missing some content:

parsed_doc = visual_ingestor.Doc(pages, ignore_blocks, render_format)
print("Parsed_doc", parsed_doc.blocks)

I wasn’t able to debug it completely, but I believe the problem lies within the parse function. I’m not certain if this is a bug or if BeautifulSoup is misinterpreting USD 700 as a header when it clearly isn’t. The main problem is that in this example the ignored text was a kinda important title, so it was nothing resembling a Header really.

Any help is appreciated

Example pdf:

Here is the pdf that has been causing me problems. It's not complete for privacy reasons, but it's the minimum example I found. that causes this problem. If I edit it a little bit, for example adding text next to the USD it won't cause this problem.

problematic-pdf.pdf

@jonhilgart22
Copy link

I'm also seeing some missing text being returned - did you have any luck debugging?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants