Skip to content

Conversation

@Chandan-Sugreevu
Copy link

Introduces a high-performance custom TVM schedule for the combined QKV-split and Rotary Positional Embedding (RoPE) operation.

This optimization forces the entire computation to run within a single fused CUDA kernel, significantly reducing kernel launch overhead and improving memory access patterns for Llama-style models on GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant