bug: missing text from parsed pdf

I’ve encountered an issue where some text is missing after parsing certain PDFs. In the attached example, the text USD 700 disappears during the parsing process.

In the next code, the pages still have all the content:

```
soup = BeautifulSoup(str(tika_html_doc), "html.parser")
print("Soup", soup)
meta_tags = soup.find_all("meta")
title = None
for tag in meta_tags:
    if tag["name"].endswith(":title"):
        title = tag["content"]
        break
pages = soup.find_all("div", class_=lambda x: x in ["page"])
print("Pages", pages)
```

However, the blocks inside the parsed document are missing some content:
```
parsed_doc = visual_ingestor.Doc(pages, ignore_blocks, render_format)
print("Parsed_doc", parsed_doc.blocks)
```

I wasn’t able to debug it completely, but I believe the problem lies within the parse function. I’m not certain if this is a bug or if BeautifulSoup is misinterpreting USD 700 as a header when it clearly isn’t. The main problem is that in this example the ignored text was a kinda important title, so it was nothing resembling a Header really.

Any help is appreciated

### Example pdf:
Here is the pdf that has been causing me problems. It's not complete for privacy reasons, but it's the minimum example I found. that causes this problem. If I edit it a little bit, for example adding text next to the USD it won't cause this problem.

[problematic-pdf.pdf](https://github.com/user-attachments/files/16197739/problematic-pdf.pdf)




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

bug: missing text from parsed pdf #75

Example pdf:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

bug: missing text from parsed pdf #75

Description

Example pdf:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions