Skip to content

Conversation

JackAKirk
Copy link
Contributor

@JackAKirk JackAKirk commented May 13, 2025

Makes short kernels that don't need to see the same global memory (or user guarantees global memory writes are complete) launch faster. See https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programmatic-dependent-launch-and-synchronization

Makes lots of short kernels in cutlass great again. cc @FMarno who identified this performance gap.

@JackAKirk JackAKirk requested review from a team as code owners May 13, 2025 13:01
@JackAKirk JackAKirk requested a review from jchlanda May 13, 2025 13:01
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
@kbenzie
Copy link
Contributor

kbenzie commented May 13, 2025

Reasonable chance this will interact/conflict with #18385

@aarongreig
Copy link
Contributor

yeah I'm going to need to rethink how devices reporting support for different properties looks I think

Copy link
Contributor

@jchlanda jchlanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to test this feature?

/// [in] non-zero value indicates the amount of work group memory to
/// allocate in bytes
size_t workgroup_mem_size;
/// [in] non-zero value indicates a opportunistic native queue serialized
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// [in] non-zero value indicates a opportunistic native queue serialized
/// [in] non-zero value indicates an opportunistic native queue serialized

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks I updated script to generate this change.

@JackAKirk
Copy link
Contributor Author

@JackAKirk
Copy link
Contributor Author

@intel/llvm-gatekeepers

This is ready to merge. Thanks

@sarnex
Copy link
Contributor

sarnex commented May 21, 2025

Do we need to wait for CI to pass?

@sarnex
Copy link
Contributor

sarnex commented May 21, 2025

Seems CI is failing, ping us when it's ready for merge

Signed-off-by: JackAKirk <[email protected]>
@JackAKirk
Copy link
Contributor Author

@intel/llvm-gatekeepers this is ready to merge. Graph functionality is independent of this change and arc graph failure described in this issue: #18668
Thanks

@dm-vodopyanov dm-vodopyanov merged commit bda408a into intel:sycl May 30, 2025
31 of 32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants