RFC: Repurposing TMA for O(N) Ultrametric Indexing and TAL-logic integration #2907
StanByriukov02
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I explored the capabilities of the Tensor Memory Accelerator (TMA) on the Hopper/Blackwell architecture. Traditionally, TMA is used for asynchronous data copying, but I found a way to use it for topological indexing of LCP (Longest Common Prefix) trees in O(N) time.
This enables the implementation of TAL (Thermal-Aware Logic)—an approach that minimizes bit state flips (entropy injection) when searching in huge contexts. My measurements on the H100 show a 2.5x reduction in power consumption compared to standard TC cores when working with 500k+ contexts.
I've prepared a Force Packet with NCU logs and JSON benchmarks. It would be interesting to discuss the possibility of integrating this primitive into the main CUTLASS release to accelerate Long-Context Retrieval.
CTDR_public_pack_20251219.zip
Beta Was this translation helpful? Give feedback.
All reactions