Skip to content

Conversation

@yurekami
Copy link
Contributor

@yurekami yurekami commented Jan 1, 2026

Summary

This PR adds an environment variable DG_MINIMIZE_NUM_SMS to control whether DeepGEMM should minimize the number of SMs used for kernel execution, addressing #239.

Background

The current SM minimization logic in get_best_config() recomputes the minimal number of SMs required:

if (ArchSpec::should_minimize_num_sms()) {
    num_min_sms = ceil_div(ceil_div(m, best_block_m) * ceil_div(n, best_block_n) * num_groups, best_num_waves);
    num_min_sms = align(num_min_sms, best_multicast_config.num_multicast);
}

This can invoke kernel compilation multiple times with different configurations, causing unstable time-to-first-token (TTFT) during inference.

Changes

  • Modified SM90ArchSpec::should_minimize_num_sms() to check env var
  • Modified SM100ArchSpec::should_minimize_num_sms() to check env var

Usage

# Enable SM minimization (default, current behavior)
export DG_MINIMIZE_NUM_SMS=1

# Disable SM minimization for stable TTFT
export DG_MINIMIZE_NUM_SMS=0

Benefits

  • DG_MINIMIZE_NUM_SMS=1 (default): Better L2 cache usage, reduced GPU frequency drops
  • DG_MINIMIZE_NUM_SMS=0: More predictable compilation, stable TTFT

Test Plan

  • Verify default behavior is unchanged (SM minimization enabled)
  • Verify DG_MINIMIZE_NUM_SMS=0 disables SM minimization
  • Compare TTFT stability with env var disabled

Fixes #239

🤖 Generated with Claude Code

This adds an environment variable to control whether DeepGEMM should
minimize the number of SMs used for kernel execution.

Background:
The SM minimization logic recomputes the minimal number of SMs required
for each GEMM configuration, which can cause multiple kernel compilations
and result in unstable time-to-first-token (TTFT) during inference.

Usage:
- DG_MINIMIZE_NUM_SMS=1 (default): Enable SM minimization for better
  L2 cache usage and reduced GPU frequency drops
- DG_MINIMIZE_NUM_SMS=0: Disable SM minimization for more predictable
  compilation behavior and stable TTFT

The change is backward compatible - the default behavior remains unchanged.

Fixes deepseek-ai#239

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] Maybe we can add env var to control whether use min sms?

1 participant