Skip to content

Conversation

@intelgaoxiong
Copy link
Contributor

@intelgaoxiong intelgaoxiong commented Jan 28, 2026

Details:

Background:
#33372 implemented HOST_ROUTED processing for MoE decoding.
But the trivial submission overhead limits the decoding throughput.

Optimization:
This PR optimized MoE TPS with DEVICE_ROUTED processing:

  • Experts selection is performed dynamically on the device using Gather operations, avoiding graph splitting and reducing host-device overhead.
  • Infer execution is the same with traditional LLM.

TPS can be improved from 12 t/s to 17.9 t/s.

NPUW config:

{
	"NPUW_DEVICES" : "NPU",
	"MAX_PROMPT_LEN" : 1024,
	"NPUW_MOE_TOKEN_CHUNK_SIZE" : 0,
	"NPUW_LLM_GENERATE_MOE_HINT" : "DEVICE_ROUTED",
	"NPUW_F16IC" : "YES",
	"NPUW_LLM_OPTIMIZE_V_TENSORS" : "YES",
	"NPU_TURBO" : "YES",
	"NPUW_DUMP_SUBS" : "YES",
	"NPUW_DUMP_IO" : "NO",
	"NPU_COMPILER_TYPE" : "DRIVER"
}

Tickets:

@github-actions github-actions bot added category: build OpenVINO cmake script / infra category: samples OpenVINO Runtime Samples category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin and removed category: samples OpenVINO Runtime Samples labels Jan 28, 2026
@intelgaoxiong intelgaoxiong force-pushed the xiong/gpt-oss_device_routed branch 2 times, most recently from d8f7978 to 10e6b84 Compare January 30, 2026 05:43
@intelgaoxiong intelgaoxiong changed the title [NPUW]DEVICE_ROUTED mode for MoE (GPT-OSS-20B) decoding on NPU. [NPUW]Optimize MoE (GPT-OSS-20B) TPS on NPU - DEVICE_ROUTED. Jan 30, 2026
@intelgaoxiong intelgaoxiong force-pushed the xiong/gpt-oss_device_routed branch from 10e6b84 to 97b9ea3 Compare January 31, 2026 02:55
@github-actions github-actions bot removed the category: build OpenVINO cmake script / infra label Jan 31, 2026
@intelgaoxiong intelgaoxiong marked this pull request as ready for review January 31, 2026 03:06
@intelgaoxiong intelgaoxiong requested review from a team as code owners January 31, 2026 03:06
@dmatveev dmatveev added this to the 2026.1 milestone Feb 1, 2026
@intelgaoxiong intelgaoxiong force-pushed the xiong/gpt-oss_device_routed branch 3 times, most recently from 3dacd25 to 270410d Compare February 3, 2026 05:20
@github-actions github-actions bot added the category: build OpenVINO cmake script / infra label Feb 3, 2026
@intelgaoxiong
Copy link
Contributor Author

#33924 is included.
Should be merged after #33924

Convert gather to 2D.

Gather before convert.

Keep gather indices as constant.

Use JustInferRequest for DEVICE_ROUTED mode.

Clean up transformations for DEVICE_ROUTED.

Update config for DEVICE_ROUTED: BEST_PERF + not cut LM head.

Refactor device routed transformation.

Refactor GatherTo2DGather.

Apply MoE defaults if not explicitly set in external config.

Collect MoE nodes in single loop.

Signed-off-by: intelgaoxiong <[email protected]>
@intelgaoxiong intelgaoxiong force-pushed the xiong/gpt-oss_device_routed branch from 270410d to 7099966 Compare February 4, 2026 01:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: build OpenVINO cmake script / infra category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants