[Discussion] Purpose of hashed id for indexing #3868

JJK801 · 2025-01-14T16:28:51Z

I'm currently in trouble with indexing feature for managing large files (8~10Mo CSV & PDF), because some ids are conflicting (duplicate key in postgres) probably (not sure but it seems obvious) due to the hashed ids:

const contentHash = this._hashStringToUUID(this.pageContent)

try {
    const metadataHash = this._hashNestedDictToUUID(this.metadata)
    this.contentHash = contentHash
    this.metadataHash = metadataHash
} catch (e) {
    throw new Error(`Failed to hash metadata: ${e}. Please use a dict that can be serialized using json.`)
}

this.hash_ = this._hashStringToUUID(this.contentHash + this.metadataHash)

if (!this.uid) {
    this.uid = this.hash_
}

Long story short: I upserted a large PDF (10Mo), everything goes well, then i tried to add 1 Large CSV (8Mo) and i got an error due to duplicated key, so i cleaned up and tried again with the first PDF and another small (200ko) PDF, duplicated key again.

So, by inspecting the code, i was wondering why we don't simply use the chunk id as the document id ? it should be way safer as they are auto generated and unique.

Anyway, it's a bad practice to generate a hash from other hashes, because it drastically rises the risk of collision.

I can do a PR with backward compatibility if needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] Purpose of hashed id for indexing #3868

[Discussion] Purpose of hashed id for indexing #3868

JJK801 commented Jan 14, 2025

[Discussion] Purpose of hashed id for indexing #3868

[Discussion] Purpose of hashed id for indexing #3868

Comments

JJK801 commented Jan 14, 2025