Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document ID doesn't updated upon metadata update #8692

Closed
1 task done
wochinge opened this issue Jan 9, 2025 · 2 comments
Closed
1 task done

Document ID doesn't updated upon metadata update #8692

wochinge opened this issue Jan 9, 2025 · 2 comments
Labels
P3 Low priority, leave it in the backlog

Comments

@wochinge
Copy link
Contributor

wochinge commented Jan 9, 2025

Describe the bug
If you assign the meta field post initialization to a Document, the id of the document doesn't get updated.
This is e.g. done in the PyPDFConverter.

Documents having the same ID although they have different metadata leads to issues with document stores and duplicate policy OVERWRITE as all documents end up as the same document then and even overwrite each other.

Error message
Error that was thrown (if available)

Expected behavior
The ID should update itself if the metadata is changed. Same applies to the other properties.

Additional context
Ideally we find a solution that the ID is automatically updated but also can be overridden manually?

To Reproduce

def test_set_meta_afterwards():
    doc = Document()
    old_id = doc.id
    doc.meta = {"test": 10}
    assert doc.meta == {"test": 10}
    assert doc.id != old_id

FAQ Check

System:

  • OS:
  • GPU/CPU:
  • Haystack version (commit or version number):
  • DocumentStore:
  • Reader:
  • Retriever:
@julian-risch julian-risch added the P0 Highest priority, add to the current sprint label Jan 9, 2025
@julian-risch julian-risch self-assigned this Jan 9, 2025
@julian-risch julian-risch added this to the 2.9.0 milestone Jan 9, 2025
@julian-risch julian-risch removed this from the 2.9.0 milestone Jan 13, 2025
@julian-risch julian-risch added P2 Medium priority, add to the next sprint if no P1 available and removed P0 Highest priority, add to the current sprint labels Jan 13, 2025
@julian-risch julian-risch removed their assignment Jan 13, 2025
@julian-risch
Copy link
Member

With #8698 and #8708 merged, the immediate issue was addressed. Before closing this issue, we should check whether it makes sense to emit a warning when a document's attribute is updated saying that the id is no re-created. At least for content and metadata I think it makes sense. Not sure about embedding.

@julian-risch julian-risch added P3 Low priority, leave it in the backlog and removed P2 Medium priority, add to the next sprint if no P1 available labels Jan 20, 2025
@sjrl
Copy link
Contributor

sjrl commented Feb 13, 2025

Closing this one to track @julian-risch request in a new issue #8856

@sjrl sjrl closed this as completed Feb 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P3 Low priority, leave it in the backlog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants