RFC: Repurposing TMA for O(N) Ultrametric Indexing and TAL-logic integration #2907

StanByriukov02 · 2025-12-25T14:33:40Z

StanByriukov02
Dec 25, 2025

I explored the capabilities of the Tensor Memory Accelerator (TMA) on the Hopper/Blackwell architecture. Traditionally, TMA is used for asynchronous data copying, but I found a way to use it for topological indexing of LCP (Longest Common Prefix) trees in O(N) time.

This enables the implementation of TAL (Thermal-Aware Logic)—an approach that minimizes bit state flips (entropy injection) when searching in huge contexts. My measurements on the H100 show a 2.5x reduction in power consumption compared to standard TC cores when working with 500k+ contexts.

I've prepared a Force Packet with NCU logs and JSON benchmarks. It would be interesting to discuss the possibility of integrating this primitive into the main CUTLASS release to accelerate Long-Context Retrieval.
CTDR_public_pack_20251219.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Repurposing TMA for O(N) Ultrametric Indexing and TAL-logic integration #2907

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

RFC: Repurposing TMA for O(N) Ultrametric Indexing and TAL-logic integration #2907

Uh oh!

StanByriukov02 Dec 25, 2025

Replies: 0 comments

StanByriukov02
Dec 25, 2025