[Perf] improve the hash kernel for mm

The current `gpu_tensor_hash` implementated in #5974  has following drawbacks:
1. `add` itself is not a very decent reduction method
2. will perform an on-cpu reduction, which is not very performant for large tensors

## TODO

1. Rewrite a performant and robust tensor hash function
2. Test the performance, consistency and correctness of the hash function against real data


## Reference

You can reference [here](https://github.com/sgl-project/sglang/pull/5974#issuecomment-3017284280) for inspirations


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Perf] improve the hash kernel for mm #8054

TODO

Reference

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Perf] improve the hash kernel for mm #8054

Description

TODO

Reference

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions