[v2][adjuster] Enhance Span Hash Adjuster For Spans That Have Already Been Hashed #6393
Labels
good first issue
Good for beginners
help wanted
Features that maintainers are willing to accept but do not have cycles to implement
Originally posted by @yurishkuro in #6391 (comment)
Some storage backends (Cassandra, in particular), perform similar deduping by computing a hash before the span is saved and using it as part of the partition key (it creates tombstones if identical span is saved 2 times or more but no dups on read). So we could make this hashing process to be a part of the ingestion pipeline (e.g. in sanitizers) and simply store it as an attribute on the span. Then this adjuster would be "lazy", it will only recompute the hash if it doesn't already exist in the storage.
If we do this on the write path, we would want this to be as efficient as possible, so we would need to implement manual hashing by iterating through the attributes (and pre-sorting them to avoid dependencies) and but manually going through all fields of the Span / SpanEvent / SpanLink. The reason I was reluctant to do that in the past was to avoid unintended bugs if the data model was changed, like a new field added that we'd forget to add to the hash function. To protect against that we probably could use some fuzzing tests, by setting / unsetting each field individually and making sure the hash code changes as a result.
The text was updated successfully, but these errors were encountered: