You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm currently in trouble with indexing feature for managing large files (8~10Mo CSV & PDF), because some ids are conflicting (duplicate key in postgres) probably (not sure but it seems obvious) due to the hashed ids:
constcontentHash=this._hashStringToUUID(this.pageContent)try{constmetadataHash=this._hashNestedDictToUUID(this.metadata)this.contentHash=contentHashthis.metadataHash=metadataHash}catch(e){thrownewError(`Failed to hash metadata: ${e}. Please use a dict that can be serialized using json.`)}this.hash_=this._hashStringToUUID(this.contentHash+this.metadataHash)if(!this.uid){this.uid=this.hash_}
Long story short: I upserted a large PDF (10Mo), everything goes well, then i tried to add 1 Large CSV (8Mo) and i got an error due to duplicated key, so i cleaned up and tried again with the first PDF and another small (200ko) PDF, duplicated key again.
So, by inspecting the code, i was wondering why we don't simply use the chunk id as the document id ? it should be way safer as they are auto generated and unique.
Anyway, it's a bad practice to generate a hash from other hashes, because it drastically rises the risk of collision.
I can do a PR with backward compatibility if needed.
The text was updated successfully, but these errors were encountered:
Hi @HenryHengZJ,
I'm currently in trouble with indexing feature for managing large files (8~10Mo CSV & PDF), because some ids are conflicting (duplicate key in postgres) probably (not sure but it seems obvious) due to the hashed ids:
Long story short: I upserted a large PDF (10Mo), everything goes well, then i tried to add 1 Large CSV (8Mo) and i got an error due to duplicated key, so i cleaned up and tried again with the first PDF and another small (200ko) PDF, duplicated key again.
So, by inspecting the code, i was wondering why we don't simply use the chunk id as the document id ? it should be way safer as they are auto generated and unique.
Anyway, it's a bad practice to generate a hash from other hashes, because it drastically rises the risk of collision.
I can do a PR with backward compatibility if needed.
The text was updated successfully, but these errors were encountered: