CUDA op getrows fails for long sequences#11189
Open
milot-mirdita wants to merge 1 commit intoggml-org:masterfrom
Open
CUDA op getrows fails for long sequences#11189milot-mirdita wants to merge 1 commit intoggml-org:masterfrom
milot-mirdita wants to merge 1 commit intoggml-org:masterfrom
Conversation
T5 embeddings have a square input pos tensor which quickly exceeds the 65k limit of getrows Implemented only for _float, need other implementations
Comment on lines
+122
to
+123
| static const int64_t MAX_GRID_Y = 65535; | ||
| for (int64_t startY = 0; startY < ne10; startY += MAX_GRID_Y) { |
Collaborator
There was a problem hiding this comment.
This code is incorrect. The grid y dimension uses 16 bits and ranges from 0 to 65535 (inclusive). So the correct stride would be 65536. With this code two threads per grid write to the same address (though this should result in identical results). The correct way to fix this would be to modify the CUDA kernel and have it iterate with a stride of 65536 over the y dimension. This will also avoid issues with the number of nodes in a CUDA graph varying depending on input parameters.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I have integrated the ProstT5 protein language into Foldseek. Thanks a lot for the great library! I am upstreaming a few fixes for issues I found in ggml during the integration. I hope that it's okay to push the changes here and that they get synced at some point to the main ggml repo.
The T5 encoder has a square input pos tensor (
llm_build_pos_bucket(cause = false)) which quickly exceeds the 65k limit (on most GPUs?) of the CUDA GET_ROWS op.I have implemented this only for the _float op and I don't feel very confident in CUDA programming. I tested this one specifically for my use-case against the reference implementation of my model, but don't have models ready to test for quantized versions.