Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Purpose of hashed id for indexing #3868

Open
JJK801 opened this issue Jan 14, 2025 · 0 comments
Open

[Discussion] Purpose of hashed id for indexing #3868

JJK801 opened this issue Jan 14, 2025 · 0 comments

Comments

@JJK801
Copy link
Contributor

JJK801 commented Jan 14, 2025

Hi @HenryHengZJ,

I'm currently in trouble with indexing feature for managing large files (8~10Mo CSV & PDF), because some ids are conflicting (duplicate key in postgres) probably (not sure but it seems obvious) due to the hashed ids:

const contentHash = this._hashStringToUUID(this.pageContent)

try {
    const metadataHash = this._hashNestedDictToUUID(this.metadata)
    this.contentHash = contentHash
    this.metadataHash = metadataHash
} catch (e) {
    throw new Error(`Failed to hash metadata: ${e}. Please use a dict that can be serialized using json.`)
}

this.hash_ = this._hashStringToUUID(this.contentHash + this.metadataHash)

if (!this.uid) {
    this.uid = this.hash_
}

Long story short: I upserted a large PDF (10Mo), everything goes well, then i tried to add 1 Large CSV (8Mo) and i got an error due to duplicated key, so i cleaned up and tried again with the first PDF and another small (200ko) PDF, duplicated key again.

So, by inspecting the code, i was wondering why we don't simply use the chunk id as the document id ? it should be way safer as they are auto generated and unique.

Anyway, it's a bad practice to generate a hash from other hashes, because it drastically rises the risk of collision.

I can do a PR with backward compatibility if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant