Setting this operator in
|
UNSUPPORTED_THUNDER_FUNCTION = () |
gives 10x decode throughput improvement on
openai/gpt-oss-20b and
Qwen/Qwen3-32B inside SGLang
bench_one_batch.py.
Repro (this only works on the NV internal thunder-sglang-integration codebase):
SGLANG_USE_THUNDER_GRAPH_RUNNER=1 python3 -m sglang.bench_one_batch --model-path openai/gpt-oss-20b --trust-remote-code --model-impl transformers --dtype bfloat16 --json-model-override-args '{"quantization_config": null}' --cuda-graph-bs 1 --tp-size 4 --tp-strategy dtensor --load-format-dummy