Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDFMinerToDocument updates Document's meta field after initializing it #8701

Closed
1 task done
julian-risch opened this issue Jan 10, 2025 · 0 comments · Fixed by #8708
Closed
1 task done

PDFMinerToDocument updates Document's meta field after initializing it #8701

julian-risch opened this issue Jan 10, 2025 · 0 comments · Fixed by #8708
Assignees
Labels
P1 High priority, add to the next sprint

Comments

@julian-risch
Copy link
Member

julian-risch commented Jan 10, 2025

Describe the bug
The PDFMinerToDocument updates meta data of a Document after initializing it instead of setting the meta data at initialization:

document.meta = merged_metadata

This can lead to two documents with the same content but different metadata to be assigned the same document id. When these documents are written to a DocumentStore, they will be handled as duplicates although the aren't.

Expected behavior
Set meta data of Document in PDFMinerToDocument when Document is initialized instead of updating meta data later.

Additional context
A similar issue was fixed for PyPDFToDocument in #8698

To Reproduce
Use PDFMinerToDocument with an empty PDF, which returns a Document with empty content with filepath stored in meta. Then compare the id that document to the id of Document with empty document and no meta data. They IDs are currently the same but should differ as in this test:

def test_run_empty_document(self, caplog, test_files_path):

FAQ Check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 High priority, add to the next sprint
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant